ci: unified benchmark suite with full baselines and regression gate by clean6378-max-it · Pull Request #121 · cppalliance/cppa-cursor-browser

clean6378-max-it · 2026-06-25T20:57:15Z

Closes #110

Summary

Unify parse, export, search, and summary-cache benchmarks under tests/benchmarks/conftest.py with synthetic corpora (no real Cursor storage)
Add test_parse_bench.py, test_export_bench.py, test_search_bench.py; extend summary-cache with merged test_summary_cache_lookup[hit|miss] and tab-summary benchmarks
Extend check_benchmark_regression.py: stale-baseline detection (<50% of baseline), EXCLUDED_FROM_GATE for noisy sub-ms benches
Add scripts/reduce_baselines.py, Makefile, benchmarks/README.md; populate benchmarks/baselines.json for all groups
CI benchmarks job runs full tests/benchmarks/ suite

Deferred from #120 (addressed)

Merge hit/miss cache benchmarks into parametrized lookup test
Stale baseline drift detection + unit tests
Tab-summary cache benchmarks
EXCLUDED_FROM_GATE for smallest parse corpus + search

Test plan

pytest tests/benchmarks/ --benchmark-only -o addopts= -v — 14 passed
pytest tests/test_check_benchmark_regression.py -o addopts= — 17 passed
python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json — green locally
Refresh benchmarks/baselines.json from ubuntu-latest CI artifact before merge
Inject >20% slowdown — confirm gate fails, then revert

Depends on #120. Closes #110 / sprint item #8.

Summary by CodeRabbit

New Features
- Added benchmark Makefile targets for seeding baselines, running the benchmark regression gate, and cleaning artifacts (with a deprecated alias for updating baselines).
- Expanded the benchmark suite to cover parse, export, search, and additional cache-lookup scenarios, including API export and PDF export checks in CI.
Bug Fixes
- Improved benchmark regression gating to also detect and fail on stale performance results.
Documentation
- Added/expanded benchmark README with local run steps and CI gate/baseline refresh guidance.
Tests
- Added synthetic benchmark fixtures plus new/updated benchmarks and regression-check coverage.
Chores
- Refreshed benchmark baselines/metadata, workflow labeling, and benchmark artifact ignore rules.

coderabbitai · 2026-06-25T20:57:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR updates benchmark fixtures and cases, adds stale and excluded benchmark gating, and introduces baseline reduction, regression checking, documentation, CI labels, and Makefile tasks for the benchmark suite.

Changes

Benchmark suite and gating updates

Layer / File(s)	Summary
Synthetic benchmark corpus and cases `tests/benchmarks/conftest.py`, `tests/benchmarks/constants.py`, `tests/benchmarks/test_*_bench.py`	Synthetic storage fixtures support parse, export, search, and summary-cache benchmarks with updated benchmark groups and parametrization.
Regression checker stale gating `scripts/check_benchmark_regression.py`, `tests/test_check_benchmark_regression.py`	Regression checks add stale classification, excluded benchmarks, and a stale-floor CLI option, and the tests track the renamed gated benchmark.
Baseline reducer and reducer tests `scripts/reduce_baselines.py`, `tests/test_reduce_baselines.py`	The reducer CLI reads benchmark JSON, normalizes grouped baselines, and the unit tests cover grouping, notes, duplicate detection, and float validation.
Benchmark workflow, baselines, and docs `Makefile`, `benchmarks/baselines.json`, `benchmarks/README.md`, `.github/workflows/tests.yml`, `.gitignore`	The Makefile adds baseline generation, regression checking, and cleanup targets; benchmark baselines are refreshed; workflow labels change; and benchmark documentation and ignore rules are updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cppalliance/cppa-cursor-browser#120: Updates the same benchmark gating and summary-cache naming area, including the checker, baselines, and benchmark suite flow.

Suggested reviewers

wpak-ai

Poem

A bunny bounced through benchmark land,
with baselines held in paw and hand.
Stale hops hush, fresh hops sing,
JSON carrots wobble in spring.
Hop-hop—our suite now hums just right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: unifying benchmark CI with expanded baselines and a regression gate.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ci/benchmark-suite-baselines

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/check_benchmark_regression.py (1)
169-187: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Bound the CLI ratios to the documented gate semantics.

--stale-floor currently accepts any float, so 0 disables stale detection and values >= 1 make almost every benchmark stale. Likewise, --threshold <= 1 flips the regression gate into “fail on non-regressions”. Please validate these inputs before calling check_regression().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 169 - 187, The CLI
arguments in the main parser flow allow invalid gate ratios, so add validation
before calling check_regression() to enforce the documented semantics: keep
--threshold strictly greater than 1 and --stale-floor strictly between 0 and 1.
Update the argument handling near parser.parse_args(argv) in
check_benchmark_regression.py to reject out-of-range values with a clear error
message, using the existing args.threshold and args.stale_floor symbols.

🧹 Nitpick comments (1)

scripts/reduce_baselines.py (1)
73-79: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep the generated _note truthful for local seeds.

make seed-baselines-local runs this reducer on host timings, but the emitted note always says the data came from ubuntu-latest CI. The checked-in benchmarks/baselines.json already had to override that manually. Please derive the note from the actual source (or add a --source flag) so the reducer output is self-describing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 73 - 79, The generated _note in
reduce_baselines.py is hardcoded to say the baselines came from ubuntu-latest
CI, which is inaccurate for local seed runs. Update the reducer logic around the
output dict construction so the note is derived from the actual input source, or
add a --source option that is threaded into the _note text. Keep the note
builder self-describing by using the existing reduce_baselines.py entrypoint and
its output assembly, rather than relying on a fixed CI-only message.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-4: The baseline snapshot is currently seeded from a local
Windows run, which does not match the gated Ubuntu benchmark source. Update the
entries in benchmarks/baselines.json using the ubuntu-latest CI
benchmark-results.json artifact before merging, and keep the existing
note/metadata aligned with that source so the gate compares like-for-like
results rather than OS-specific noise.

In `@tests/benchmarks/test_parse_bench.py`:
- Around line 26-27: The benchmark in benchmark(_run) only checks that projects
is a non-empty list, so it can still pass when parsing emits warnings. Update
the assertions around the _warnings result to require that the synthetic corpus
parses without warnings, alongside the existing projects check, so the parse
benchmark fails on new warnings from the parse path.

In `@tests/benchmarks/test_summary_cache_bench.py`:
- Around line 29-32: The benchmark setup in get_cached_projects is only timing
the lookup and never verifies the returned cache state, so add an assertion
after benchmark(...) in each benchmark helper to confirm the expected hit or
miss result. Use the existing symbols set_cached_projects, get_cached_projects,
workspace_fingerprint, and stale_fingerprint to check that the hit case returns
the seeded sample_projects and the miss case returns the empty/missing result,
so malformed cache behavior cannot slip through.

---

Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 169-187: The CLI arguments in the main parser flow allow invalid
gate ratios, so add validation before calling check_regression() to enforce the
documented semantics: keep --threshold strictly greater than 1 and --stale-floor
strictly between 0 and 1. Update the argument handling near
parser.parse_args(argv) in check_benchmark_regression.py to reject out-of-range
values with a clear error message, using the existing args.threshold and
args.stale_floor symbols.

---

Nitpick comments:
In `@scripts/reduce_baselines.py`:
- Around line 73-79: The generated _note in reduce_baselines.py is hardcoded to
say the baselines came from ubuntu-latest CI, which is inaccurate for local seed
runs. Update the reducer logic around the output dict construction so the note
is derived from the actual input source, or add a --source option that is
threaded into the _note text. Keep the note builder self-describing by using the
existing reduce_baselines.py entrypoint and its output assembly, rather than
relying on a fixed CI-only message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: acdca232-8b23-4aa3-b046-022f0310fe57

📥 Commits

Reviewing files that changed from the base of the PR and between ecd08c6 and 5eef897.

📒 Files selected for processing (12)

.github/workflows/tests.yml
Makefile
benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/benchmarks/conftest.py
tests/benchmarks/test_export_bench.py
tests/benchmarks/test_parse_bench.py
tests/benchmarks/test_search_bench.py
tests/benchmarks/test_summary_cache_bench.py
tests/test_check_benchmark_regression.py

clean6378-max-it · 2026-06-25T21:33:15Z

@CodeRabbit, please review this PR in more depth.

coderabbitai · 2026-06-25T21:33:29Z

@clean6378-max-it I'll do a deeper pass over the PR, with extra attention on the benchmark fixtures, regression-gate logic, baselines, and CI wiring.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 13-18: The exclusion list in EXCLUDED_FROM_GATE is stale and no
longer matches the benchmark timings recorded in baselines.json. Update
check_benchmark_regression.py to revisit the entries for
test_list_workspace_projects_nocache[composers-10] and test_search_full_corpus,
removing or narrowing exclusions where the benchmarks are no longer sub-ms noisy
so the main search benchmark is included in the regression gate.
- Around line 181-187: The threshold validation in the argument parsing path is
too permissive because `args.threshold <= 1` does not reject non-finite values
like `nan` or `inf`. Update the validation in `check_benchmark_regression.py`
around `parser.parse_args(argv)` so the threshold must be finite before
continuing, alongside the existing `> 1` check. Use the existing
`args.threshold` handling and `--threshold` error path to reject any non-finite
input with the same style of stderr message and exit code.

In `@scripts/reduce_baselines.py`:
- Around line 24-28: The _positive_float validator currently accepts non-finite
values like nan and inf because it only checks parsed <= 0. Update
_positive_float in reduce_baselines.py to reject any non-finite float before
returning, so argparse.ArgumentTypeError is raised for invalid --slack inputs.
Keep the existing positive-value check, and use the same validator path so
benchmarks/baselines.json only ever receives finite slack values.
- Around line 65-68: The baseline generation in reduce_baselines.py is silently
skipping entries whose group is not in GATED_GROUPS, which can hide typos or
missing group values and weaken regression gating. Update the logic around the
entry.get("group") check so unknown or absent groups fail fast instead of
continuing, and make sure the error clearly names the offending benchmark/group;
keep the existing groups[group][bench_name] assignment only for validated groups
in the reduce_baselines flow.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b403ca3-029c-467f-9cc6-2196a70422fd

📥 Commits

Reviewing files that changed from the base of the PR and between ecd08c6 and fcda8ed.

📒 Files selected for processing (13)

.github/workflows/tests.yml
.gitignore
Makefile
benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/benchmarks/conftest.py
tests/benchmarks/test_export_bench.py
tests/benchmarks/test_parse_bench.py
tests/benchmarks/test_search_bench.py
tests/benchmarks/test_summary_cache_bench.py
tests/test_check_benchmark_regression.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 67-78: The baseline aggregation in reduce_baselines.py silently
overwrites entries when normalize_benchmark_name() maps multiple raw benchmark
names to the same bench_name. Before assigning into groups[group][bench_name],
add an explicit duplicate check for the normalized key and raise
BenchmarkDataError when the same group/bench_name appears more than once so
check_regression() keeps full coverage.

In `@tests/test_reduce_baselines.py`:
- Line 64: The baseline note assertion uses the Unicode multiplication symbol,
which triggers the lint warning. Update the test in the relevant assertion to
use ASCII x instead of × so the expected note string matches the lint-safe text,
keeping the check aligned with the baseline content in the test.
- Around line 55-65: The reducer test is missing coverage for the summary-cache
benchmark, so a regression in that path could slip through. Update the
assertions in the test around reduce_baselines output to also verify the
summary-cache entry in groups, using the same pattern as the existing
parse/export/search checks and referencing the groups data returned from
reduce_baselines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 202aeaff-4ebe-4282-9b73-81f47ad1cfb9

📥 Commits

Reviewing files that changed from the base of the PR and between fcda8ed and 87a747c.

📒 Files selected for processing (12)

.github/workflows/tests.yml
Makefile
benchmarks/README.md
benchmarks/baselines.json
scripts/check_benchmark_regression.py
scripts/reduce_baselines.py
tests/benchmarks/conftest.py
tests/benchmarks/constants.py
tests/benchmarks/test_search_bench.py
tests/benchmarks/test_summary_cache_bench.py
tests/test_check_benchmark_regression.py
tests/test_reduce_baselines.py

✅ Files skipped from review due to trivial changes (2)

benchmarks/README.md
tests/benchmarks/constants.py

🚧 Files skipped from review as they are similar to previous changes (6)

Makefile
tests/benchmarks/test_search_bench.py
benchmarks/baselines.json
tests/benchmarks/test_summary_cache_bench.py
scripts/check_benchmark_regression.py
tests/benchmarks/conftest.py

bradjin8 · 2026-06-25T22:29:02Z

+    def _run() -> object:
+        return bench_client_search_corpus.get(
+            f"/api/search?q={BENCH_SEARCH_TERM}&all_history=1",
+        )


Search benchmark sets CURSOR_CHAT_BROWSER_NO_SEARCH_INDEX=1, so it only measures the live-scan fallback. The FTS index path from #113 is never gated. Consider a second benchmark (or env toggle) for index-backed search, or document in README that this bench tracks live-scan only.

bradjin8 · 2026-06-25T22:29:34Z

Line 140-151: summary_cache_dir patches projects/composer-map files but tab summaries use CACHE_DIR + hashed filenames only. Tab benchmarks work today; add a brief comment that _tab_summaries_path relies on the CACHE_DIR monkeypatch (not a separate constant patch).

bradjin8 · 2026-06-25T22:29:50Z

+      "test_summary_cache_lookup[hit]": 9.224067718099102e-05,
+      "test_summary_cache_lookup[miss]": 9.128770315496628e-05,


test_summary_cache_lookup[hit/miss] are gated at ~92µs after 1.5× slack. Sub-millisecond benches are high-variance on shared runners even with --benchmark-min-rounds=5. Monitor for flake; if CI becomes noisy, reintroduce targeted EXCLUDED_FROM_GATE entries or raise slack for lookup-only benches.

bradjin8 · 2026-06-25T22:30:10Z

+    normalize_benchmark_name,
+)
+
+GATED_GROUPS = ("parse", "export", "search", "summary-cache")


GATED_GROUPS is a hardcoded tuple. Adding a new @pytest.mark.benchmark(group=...) requires updating this list or reduce_baselines fails at refresh time. Document in benchmarks/README.md that new groups must be added here.

bradjin8 · 2026-06-25T22:30:24Z

+    [10, 50],
+    indirect=True,
+    ids=["composers-10", "composers-50"],


Export is only parametrized at 10/50 composers; parse goes to 200. Consider composers-200 export bench if export is a growth-sensitive path, or note in README why export stops at 50 (runtime budget).

bradjin8 · 2026-06-25T22:30:46Z

No benchmark for get_cached_composer_id_to_ws / set_cached_composer_id_to_ws. Issue #110 scopes to summary cache from #7; composer-map cache is the remaining uncovered hot path in summary_cache.py.

bradjin8 · 2026-06-25T22:30:58Z

+seed-baselines-local:
+	@echo "WARNING: seed-baselines-local uses this host's timings; CI gates on ubuntu-latest." >&2
+	python -m pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts=
+	python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json --slack 1.5 --source local


seed-baselines-local warns on stderr but still overwrites benchmarks/baselines.json. Consider writing to benchmarks/_raw.json only unless --force, to prevent accidental commit of host-specific baselines.

test_summary_cache_round_trip calls set/get each round; OS page-cache state causes 3-5x variation between CI runs (0.000314s vs 0.001137s). Add to EXCLUDED_FROM_GATE with comment; baseline kept for observation. Regenerate baselines.json from run 28206913751 (ubuntu-latest, 1.5x slack). Update README to document the exclusion and rationale. Co-authored-by: Cursor <cursoragent@cursor.com>

ci: unified benchmark suite with full baselines and regression gate

5eef897

clean6378-max-it self-assigned this Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread benchmarks/baselines.json Outdated

Comment thread tests/benchmarks/test_parse_bench.py Outdated

Comment thread tests/benchmarks/test_summary_cache_bench.py Outdated

fix(bench): address review feedback and seed ubuntu baselines

fcda8ed

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread scripts/check_benchmark_regression.py Outdated

Comment thread scripts/check_benchmark_regression.py Outdated

Comment thread scripts/reduce_baselines.py

Comment thread scripts/reduce_baselines.py

fix(bench): gate all baseline benchmarks and validate finite ratios

87a747c

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread scripts/reduce_baselines.py

Comment thread tests/test_reduce_baselines.py

Comment thread tests/test_reduce_baselines.py Outdated

fix(bench): harden reduce_baselines and fix Python 3.10 CI

29ad5a7

clean6378-max-it requested review from bradjin8 and wpak-ai June 25, 2026 22:07

clean6378-max-it marked this pull request as draft June 25, 2026 22:18

clean6378-max-it marked this pull request as ready for review June 25, 2026 22:22

clean6378-max-it removed the request for review from wpak-ai June 25, 2026 22:22

bradjin8 requested changes Jun 25, 2026

View reviewed changes

clean6378-max-it and others added 3 commits June 26, 2026 07:11

fix(bench): address bradjin8 review on search, cache, and baselines

c5c2066

chore(bench): refresh baselines from ubuntu CI run 28206463463

5a51d70

clean6378-max-it requested review from bradjin8 June 25, 2026 23:49

		"test_summary_cache_lookup[hit]": 9.224067718099102e-05,
		"test_summary_cache_lookup[miss]": 9.128770315496628e-05,

Uh oh!

Conversation

clean6378-max-it commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deferred from #120 (addressed)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clean6378-max-it commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bradjin8 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clean6378-max-it commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading