ci: unified benchmark suite with full baselines and regression gate#121
ci: unified benchmark suite with full baselines and regression gate#121clean6378-max-it wants to merge 7 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR updates benchmark fixtures and cases, adds stale and excluded benchmark gating, and introduces baseline reduction, regression checking, documentation, CI labels, and Makefile tasks for the benchmark suite. ChangesBenchmark suite and gating updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/check_benchmark_regression.py (1)
169-187: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winBound the CLI ratios to the documented gate semantics.
--stale-floorcurrently accepts any float, so0disables stale detection and values>= 1make almost every benchmark stale. Likewise,--threshold <= 1flips the regression gate into “fail on non-regressions”. Please validate these inputs before callingcheck_regression().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 169 - 187, The CLI arguments in the main parser flow allow invalid gate ratios, so add validation before calling check_regression() to enforce the documented semantics: keep --threshold strictly greater than 1 and --stale-floor strictly between 0 and 1. Update the argument handling near parser.parse_args(argv) in check_benchmark_regression.py to reject out-of-range values with a clear error message, using the existing args.threshold and args.stale_floor symbols.
🧹 Nitpick comments (1)
scripts/reduce_baselines.py (1)
73-79: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep the generated
_notetruthful for local seeds.
make seed-baselines-localruns this reducer on host timings, but the emitted note always says the data came fromubuntu-latestCI. The checked-inbenchmarks/baselines.jsonalready had to override that manually. Please derive the note from the actual source (or add a--sourceflag) so the reducer output is self-describing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/reduce_baselines.py` around lines 73 - 79, The generated _note in reduce_baselines.py is hardcoded to say the baselines came from ubuntu-latest CI, which is inaccurate for local seed runs. Update the reducer logic around the output dict construction so the note is derived from the actual input source, or add a --source option that is threaded into the _note text. Keep the note builder self-describing by using the existing reduce_baselines.py entrypoint and its output assembly, rather than relying on a fixed CI-only message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-4: The baseline snapshot is currently seeded from a local
Windows run, which does not match the gated Ubuntu benchmark source. Update the
entries in benchmarks/baselines.json using the ubuntu-latest CI
benchmark-results.json artifact before merging, and keep the existing
note/metadata aligned with that source so the gate compares like-for-like
results rather than OS-specific noise.
In `@tests/benchmarks/test_parse_bench.py`:
- Around line 26-27: The benchmark in benchmark(_run) only checks that projects
is a non-empty list, so it can still pass when parsing emits warnings. Update
the assertions around the _warnings result to require that the synthetic corpus
parses without warnings, alongside the existing projects check, so the parse
benchmark fails on new warnings from the parse path.
In `@tests/benchmarks/test_summary_cache_bench.py`:
- Around line 29-32: The benchmark setup in get_cached_projects is only timing
the lookup and never verifies the returned cache state, so add an assertion
after benchmark(...) in each benchmark helper to confirm the expected hit or
miss result. Use the existing symbols set_cached_projects, get_cached_projects,
workspace_fingerprint, and stale_fingerprint to check that the hit case returns
the seeded sample_projects and the miss case returns the empty/missing result,
so malformed cache behavior cannot slip through.
---
Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 169-187: The CLI arguments in the main parser flow allow invalid
gate ratios, so add validation before calling check_regression() to enforce the
documented semantics: keep --threshold strictly greater than 1 and --stale-floor
strictly between 0 and 1. Update the argument handling near
parser.parse_args(argv) in check_benchmark_regression.py to reject out-of-range
values with a clear error message, using the existing args.threshold and
args.stale_floor symbols.
---
Nitpick comments:
In `@scripts/reduce_baselines.py`:
- Around line 73-79: The generated _note in reduce_baselines.py is hardcoded to
say the baselines came from ubuntu-latest CI, which is inaccurate for local seed
runs. Update the reducer logic around the output dict construction so the note
is derived from the actual input source, or add a --source option that is
threaded into the _note text. Keep the note builder self-describing by using the
existing reduce_baselines.py entrypoint and its output assembly, rather than
relying on a fixed CI-only message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: acdca232-8b23-4aa3-b046-022f0310fe57
📒 Files selected for processing (12)
.github/workflows/tests.ymlMakefilebenchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/benchmarks/conftest.pytests/benchmarks/test_export_bench.pytests/benchmarks/test_parse_bench.pytests/benchmarks/test_search_bench.pytests/benchmarks/test_summary_cache_bench.pytests/test_check_benchmark_regression.py
|
@CodeRabbit, please review this PR in more depth. |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 13-18: The exclusion list in EXCLUDED_FROM_GATE is stale and no
longer matches the benchmark timings recorded in baselines.json. Update
check_benchmark_regression.py to revisit the entries for
test_list_workspace_projects_nocache[composers-10] and test_search_full_corpus,
removing or narrowing exclusions where the benchmarks are no longer sub-ms noisy
so the main search benchmark is included in the regression gate.
- Around line 181-187: The threshold validation in the argument parsing path is
too permissive because `args.threshold <= 1` does not reject non-finite values
like `nan` or `inf`. Update the validation in `check_benchmark_regression.py`
around `parser.parse_args(argv)` so the threshold must be finite before
continuing, alongside the existing `> 1` check. Use the existing
`args.threshold` handling and `--threshold` error path to reject any non-finite
input with the same style of stderr message and exit code.
In `@scripts/reduce_baselines.py`:
- Around line 24-28: The _positive_float validator currently accepts non-finite
values like nan and inf because it only checks parsed <= 0. Update
_positive_float in reduce_baselines.py to reject any non-finite float before
returning, so argparse.ArgumentTypeError is raised for invalid --slack inputs.
Keep the existing positive-value check, and use the same validator path so
benchmarks/baselines.json only ever receives finite slack values.
- Around line 65-68: The baseline generation in reduce_baselines.py is silently
skipping entries whose group is not in GATED_GROUPS, which can hide typos or
missing group values and weaken regression gating. Update the logic around the
entry.get("group") check so unknown or absent groups fail fast instead of
continuing, and make sure the error clearly names the offending benchmark/group;
keep the existing groups[group][bench_name] assignment only for validated groups
in the reduce_baselines flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8b403ca3-029c-467f-9cc6-2196a70422fd
📒 Files selected for processing (13)
.github/workflows/tests.yml.gitignoreMakefilebenchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/benchmarks/conftest.pytests/benchmarks/test_export_bench.pytests/benchmarks/test_parse_bench.pytests/benchmarks/test_search_bench.pytests/benchmarks/test_summary_cache_bench.pytests/test_check_benchmark_regression.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 67-78: The baseline aggregation in reduce_baselines.py silently
overwrites entries when normalize_benchmark_name() maps multiple raw benchmark
names to the same bench_name. Before assigning into groups[group][bench_name],
add an explicit duplicate check for the normalized key and raise
BenchmarkDataError when the same group/bench_name appears more than once so
check_regression() keeps full coverage.
In `@tests/test_reduce_baselines.py`:
- Line 64: The baseline note assertion uses the Unicode multiplication symbol,
which triggers the lint warning. Update the test in the relevant assertion to
use ASCII x instead of × so the expected note string matches the lint-safe text,
keeping the check aligned with the baseline content in the test.
- Around line 55-65: The reducer test is missing coverage for the summary-cache
benchmark, so a regression in that path could slip through. Update the
assertions in the test around reduce_baselines output to also verify the
summary-cache entry in groups, using the same pattern as the existing
parse/export/search checks and referencing the groups data returned from
reduce_baselines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 202aeaff-4ebe-4282-9b73-81f47ad1cfb9
📒 Files selected for processing (12)
.github/workflows/tests.ymlMakefilebenchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/benchmarks/conftest.pytests/benchmarks/constants.pytests/benchmarks/test_search_bench.pytests/benchmarks/test_summary_cache_bench.pytests/test_check_benchmark_regression.pytests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (2)
- benchmarks/README.md
- tests/benchmarks/constants.py
🚧 Files skipped from review as they are similar to previous changes (6)
- Makefile
- tests/benchmarks/test_search_bench.py
- benchmarks/baselines.json
- tests/benchmarks/test_summary_cache_bench.py
- scripts/check_benchmark_regression.py
- tests/benchmarks/conftest.py
| def _run() -> object: | ||
| return bench_client_search_corpus.get( | ||
| f"/api/search?q={BENCH_SEARCH_TERM}&all_history=1", | ||
| ) |
There was a problem hiding this comment.
Search benchmark sets CURSOR_CHAT_BROWSER_NO_SEARCH_INDEX=1, so it only measures the live-scan fallback. The FTS index path from #113 is never gated. Consider a second benchmark (or env toggle) for index-backed search, or document in README that this bench tracks live-scan only.
There was a problem hiding this comment.
Line 140-151: summary_cache_dir patches projects/composer-map files but tab summaries use CACHE_DIR + hashed filenames only. Tab benchmarks work today; add a brief comment that _tab_summaries_path relies on the CACHE_DIR monkeypatch (not a separate constant patch).
| "test_summary_cache_lookup[hit]": 9.224067718099102e-05, | ||
| "test_summary_cache_lookup[miss]": 9.128770315496628e-05, |
There was a problem hiding this comment.
test_summary_cache_lookup[hit/miss] are gated at ~92µs after 1.5× slack. Sub-millisecond benches are high-variance on shared runners even with --benchmark-min-rounds=5. Monitor for flake; if CI becomes noisy, reintroduce targeted EXCLUDED_FROM_GATE entries or raise slack for lookup-only benches.
| normalize_benchmark_name, | ||
| ) | ||
|
|
||
| GATED_GROUPS = ("parse", "export", "search", "summary-cache") |
There was a problem hiding this comment.
GATED_GROUPS is a hardcoded tuple. Adding a new @pytest.mark.benchmark(group=...) requires updating this list or reduce_baselines fails at refresh time. Document in benchmarks/README.md that new groups must be added here.
| [10, 50], | ||
| indirect=True, | ||
| ids=["composers-10", "composers-50"], |
There was a problem hiding this comment.
Export is only parametrized at 10/50 composers; parse goes to 200. Consider composers-200 export bench if export is a growth-sensitive path, or note in README why export stops at 50 (runtime budget).
| seed-baselines-local: | ||
| @echo "WARNING: seed-baselines-local uses this host's timings; CI gates on ubuntu-latest." >&2 | ||
| python -m pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts= | ||
| python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json --slack 1.5 --source local |
There was a problem hiding this comment.
seed-baselines-local warns on stderr but still overwrites benchmarks/baselines.json. Consider writing to benchmarks/_raw.json only unless --force, to prevent accidental commit of host-specific baselines.
test_summary_cache_round_trip calls set/get each round; OS page-cache state causes 3-5x variation between CI runs (0.000314s vs 0.001137s). Add to EXCLUDED_FROM_GATE with comment; baseline kept for observation. Regenerate baselines.json from run 28206913751 (ubuntu-latest, 1.5x slack). Update README to document the exclusion and rationale. Co-authored-by: Cursor <cursoragent@cursor.com>
Closes #110
Summary
tests/benchmarks/conftest.pywith synthetic corpora (no real Cursor storage)test_parse_bench.py,test_export_bench.py,test_search_bench.py; extend summary-cache with mergedtest_summary_cache_lookup[hit|miss]and tab-summary benchmarkscheck_benchmark_regression.py: stale-baseline detection (<50% of baseline),EXCLUDED_FROM_GATEfor noisy sub-ms benchesscripts/reduce_baselines.py,Makefile,benchmarks/README.md; populatebenchmarks/baselines.jsonfor all groupstests/benchmarks/suiteDeferred from #120 (addressed)
Test plan
pytest tests/benchmarks/ --benchmark-only -o addopts= -v— 14 passedpytest tests/test_check_benchmark_regression.py -o addopts=— 17 passedpython scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json— green locallybenchmarks/baselines.jsonfrom ubuntu-latest CI artifact before mergeDepends on #120. Closes #110 / sprint item #8.
Summary by CodeRabbit