test: add summary-cache pytest-benchmark suite with CI regression gate#120
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds summary-cache benchmarks, refreshed baselines, a regression-check script, and CI workflow steps to run, validate, and upload benchmark results. ChangesSummary Cache Benchmark Validation
Sequence Diagram(s)sequenceDiagram
participant benchmarks_job as benchmarks job
participant pytest as pytest
participant regression_script as scripts/check_benchmark_regression.py
participant upload_artifact as actions/upload-artifact
benchmarks_job->>pytest: run tests/benchmarks/ with --benchmark-only
pytest-->>benchmarks_job: benchmark-results.json
benchmarks_job->>regression_script: compare benchmark-results.json to benchmarks/baselines.json
regression_script-->>benchmarks_job: exit code 0, 1, or 2
benchmarks_job->>upload_artifact: upload benchmark-results.json with if: always()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/check_benchmark_regression.py (1)
95-99: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winA missing benchmark result silently passes the gate.
When a baseline benchmark has no current result, the script only prints a
WARNand continues. If a gated benchmark is renamed, removed, or fails to run, the regression gate will report success and the performance coverage silently disappears. Consider treating a missing gated benchmark as a failure (or a separate exit code) so coverage gaps are surfaced.♻️ Possible approach
failures: list[str] = [] + missing: list[str] = [] for name, base in baseline_means.items(): cur = flat.get(name) if cur is None: - print(f"WARN: no current result for baseline {name!r}; skipping") + print(f"MISSING: no current result for baseline {name!r}") + missing.append(name) continueThen fail (or warn-as-error) when
missingis non-empty, per your policy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 95 - 99, The benchmark gating in check_benchmark_regression.py currently skips missing current results in the baseline loop, which can let renamed, removed, or failing benchmarks pass unnoticed. Update the logic around the baseline_means/flat comparison so missing gated benchmarks are collected and treated as a failure (or a distinct non-zero exit condition) instead of only printing a warning. Make the behavior explicit in the final exit path of the script so coverage gaps are surfaced alongside the existing regression checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/baselines.json`:
- Around line 1-15: The benchmark baselines in the summary-cache group are still
using Windows-derived values with 1.5x slack, but the regression gate runs on
ubuntu-latest. Refresh the values from a green ubuntu-latest benchmark artifact
and keep the _note consistent with the source platform; if platform-matched
baselines are not yet available, make the gate non-blocking until they are.
Focus on the baselines.json entries under groups.summary-cache so the
check_benchmark_regression.py comparisons are against the same OS.
In `@pyproject.toml`:
- Around line 39-44: The pytest configuration currently applies --benchmark-skip
globally in [tool.pytest.ini_options], which breaks the regular integration
pytest run because pytest-benchmark is not installed there. Update the test
setup so the benchmark plugin is available wherever this addopts setting is
used, or move the --benchmark-skip option to only the benchmark-specific job;
check the pytest configuration and the workflows that invoke python -m pytest,
especially the benchmark-related markers and options.
---
Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 95-99: The benchmark gating in check_benchmark_regression.py
currently skips missing current results in the baseline loop, which can let
renamed, removed, or failing benchmarks pass unnoticed. Update the logic around
the baseline_means/flat comparison so missing gated benchmarks are collected and
treated as a failure (or a distinct non-zero exit condition) instead of only
printing a warning. Make the behavior explicit in the final exit path of the
script so coverage gaps are surfaced alongside the existing regression checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1cc8514e-6127-42e8-95a6-134382f2ed25
📒 Files selected for processing (7)
.github/workflows/tests.yml.gitignorebenchmarks/baselines.jsonpyproject.tomlscripts/check_benchmark_regression.pytests/benchmarks/conftest.pytests/benchmarks/test_summary_cache_bench.py
|
@CodeRabbit, review this PR in more depth |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== .github/workflows/tests.yml ==="
cat -n .github/workflows/tests.yml
echo ""
echo "=== pyproject.toml ==="
cat -n pyproject.toml
echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat -n scripts/check_benchmark_regression.py
echo ""
echo "=== tests/benchmarks/conftest.py ==="
cat -n tests/benchmarks/conftest.py
echo ""
echo "=== tests/benchmarks/test_summary_cache_bench.py ==="
cat -n tests/benchmarks/test_summary_cache_bench.py
echo ""
echo "=== benchmarks/baselines.json ==="
cat -n benchmarks/baselines.json
echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignore
echo ""
echo "=== requirements-lock.txt ==="
cat -n requirements-lock.txt
echo ""
echo "=== services/summary_cache.py ==="
cat -n services/summary_cache.py
echo ""
echo "=== Directory tree of tests/benchmarks/ ==="
tree tests/benchmarks/ 2>/dev/null || find tests/benchmarks/ -type f
echo ""
echo "=== Directory tree of scripts/ ==="
tree scripts/ 2>/dev/null || find scripts/ -type f
echo ""
echo "=== benchmarks/ directory ==="
tree benchmarks/ 2>/dev/null || find benchmarks/ -type fLength of output: 44058
Deep Review: PR
|
| Severity | Issue | File |
|---|---|---|
| 🔴 Critical | Baseline name format may mismatch pytest-benchmark output | baselines.json, check_benchmark_regression.py |
| 🔴 Critical | Baselines are future-dated and appear manually fabricated | baselines.json |
| 🔴 Critical | benchmarks job has no needs: [unittest] |
tests.yml |
| 🟡 Medium | Missing addopts = "--benchmark-disable" |
pyproject.toml |
| 🟡 Medium | Fragile sys.path manipulation |
conftest.py |
| 🟡 Medium | pytest-benchmark not pinned in lock file |
tests.yml, requirements-lock.txt |
| 🟡 Medium | stale_fingerprint relies on implicit None assumption |
conftest.py |
| 🔵 Minor | Missing __init__.py in tests/benchmarks/ |
— |
| 🔵 Minor | Silent skip of non-dict group values | check_benchmark_regression.py |
| 🔵 Minor | Redundant str() cast |
check_benchmark_regression.py |
| 🔵 Minor | Missing blank line after imports | test_summary_cache_bench.py |
| 🔵 Minor | No unit tests for the regression gate script | — |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 17-19: The normalize_benchmark_name helper currently uses
rsplit("::", 1), which can incorrectly strip parts of parametrized benchmark
names when parameter values include "::". Update normalize_benchmark_name to
split on the first "::" only so pytest node prefixes are removed without
truncating names like test_x[param::v], and keep the behavior consistent for
both short and full benchmark names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c745c7f7-791c-4802-9b4a-d7dc19639498
📒 Files selected for processing (8)
.github/workflows/tests.ymlbenchmarks/baselines.jsonpyproject.tomlscripts/check_benchmark_regression.pytests/benchmarks/__init__.pytests/benchmarks/conftest.pytests/benchmarks/test_summary_cache_bench.pytests/test_check_benchmark_regression.py
✅ Files skipped from review due to trivial changes (1)
- benchmarks/baselines.json
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/benchmarks/test_summary_cache_bench.py
- .github/workflows/tests.yml
wpak-ai
left a comment
There was a problem hiding this comment.
Update the docstrings and you're ready to go.
Closes #115
Summary
Adds a
pytest-benchmarksuite forservices/summary_cache.pywith synthetic fixtures and a CI regression gate. No production code changes.tests/benchmarks/test_summary_cache_bench.pymeasures cache hit, cache miss, fingerprint cost (10/50/200 workspace entries), and round-trip write+read latencytests/benchmarks/conftest.pybuilds synthetic workspace trees and isolates cache files to temp dirs (no real Cursor storage)benchmarks/baselines.jsonrecords initial gated means (1.5× slack from local reference run; refresh from ubuntu-latest CI artifact after first green job)scripts/check_benchmark_regression.pyfails CI when any benchmark mean exceeds baseline by >20%pyproject.toml: addpytest-benchmarkdev dep; default--benchmark-skipso normalpyteststays fast.github/workflows/tests.yml: newbenchmarksjob on ubuntu-latestSprint item #7 (June Week 4, Wednesday). Sequences before Thursday #8 (unified perf benchmark suite).
Test plan
pytest tests/benchmarks/test_summary_cache_bench.py --benchmark-only -o addopts=python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.jsonpytest tests/benchmarks/test_summary_cache_bench.py -o addopts=(benchmarks skipped via--benchmark-skip)python -m unittest discover tests -qbenchmarksCI job is green on ubuntu-latestSummary by CodeRabbit
pytest-benchmarkin dev tooling and configured pytest benchmark defaults.