Skip to content

ci: unified benchmark suite with full baselines and regression gate#121

Open
clean6378-max-it wants to merge 7 commits into
masterfrom
ci/benchmark-suite-baselines
Open

ci: unified benchmark suite with full baselines and regression gate#121
clean6378-max-it wants to merge 7 commits into
masterfrom
ci/benchmark-suite-baselines

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Closes #110

Summary

  • Unify parse, export, search, and summary-cache benchmarks under tests/benchmarks/conftest.py with synthetic corpora (no real Cursor storage)
  • Add test_parse_bench.py, test_export_bench.py, test_search_bench.py; extend summary-cache with merged test_summary_cache_lookup[hit|miss] and tab-summary benchmarks
  • Extend check_benchmark_regression.py: stale-baseline detection (<50% of baseline), EXCLUDED_FROM_GATE for noisy sub-ms benches
  • Add scripts/reduce_baselines.py, Makefile, benchmarks/README.md; populate benchmarks/baselines.json for all groups
  • CI benchmarks job runs full tests/benchmarks/ suite

Deferred from #120 (addressed)

  • Merge hit/miss cache benchmarks into parametrized lookup test
  • Stale baseline drift detection + unit tests
  • Tab-summary cache benchmarks
  • EXCLUDED_FROM_GATE for smallest parse corpus + search

Test plan

  • pytest tests/benchmarks/ --benchmark-only -o addopts= -v — 14 passed
  • pytest tests/test_check_benchmark_regression.py -o addopts= — 17 passed
  • python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json — green locally
  • Refresh benchmarks/baselines.json from ubuntu-latest CI artifact before merge
  • Inject >20% slowdown — confirm gate fails, then revert

Depends on #120. Closes #110 / sprint item #8.

Summary by CodeRabbit

  • New Features
    • Added benchmark Makefile targets for seeding baselines, running the benchmark regression gate, and cleaning artifacts (with a deprecated alias for updating baselines).
    • Expanded the benchmark suite to cover parse, export, search, and additional cache-lookup scenarios, including API export and PDF export checks in CI.
  • Bug Fixes
    • Improved benchmark regression gating to also detect and fail on stale performance results.
  • Documentation
    • Added/expanded benchmark README with local run steps and CI gate/baseline refresh guidance.
  • Tests
    • Added synthetic benchmark fixtures plus new/updated benchmarks and regression-check coverage.
  • Chores
    • Refreshed benchmark baselines/metadata, workflow labeling, and benchmark artifact ignore rules.

@clean6378-max-it clean6378-max-it self-assigned this Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR updates benchmark fixtures and cases, adds stale and excluded benchmark gating, and introduces baseline reduction, regression checking, documentation, CI labels, and Makefile tasks for the benchmark suite.

Changes

Benchmark suite and gating updates

Layer / File(s) Summary
Synthetic benchmark corpus and cases
tests/benchmarks/conftest.py, tests/benchmarks/constants.py, tests/benchmarks/test_*_bench.py
Synthetic storage fixtures support parse, export, search, and summary-cache benchmarks with updated benchmark groups and parametrization.
Regression checker stale gating
scripts/check_benchmark_regression.py, tests/test_check_benchmark_regression.py
Regression checks add stale classification, excluded benchmarks, and a stale-floor CLI option, and the tests track the renamed gated benchmark.
Baseline reducer and reducer tests
scripts/reduce_baselines.py, tests/test_reduce_baselines.py
The reducer CLI reads benchmark JSON, normalizes grouped baselines, and the unit tests cover grouping, notes, duplicate detection, and float validation.
Benchmark workflow, baselines, and docs
Makefile, benchmarks/baselines.json, benchmarks/README.md, .github/workflows/tests.yml, .gitignore
The Makefile adds baseline generation, regression checking, and cleanup targets; benchmark baselines are refreshed; workflow labels change; and benchmark documentation and ignore rules are updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • wpak-ai

Poem

A bunny bounced through benchmark land,
with baselines held in paw and hand.
Stale hops hush, fresh hops sing,
JSON carrots wobble in spring.
Hop-hop—our suite now hums just right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: unifying benchmark CI with expanded baselines and a regression gate.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/benchmark-suite-baselines

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/check_benchmark_regression.py (1)

169-187: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Bound the CLI ratios to the documented gate semantics.

--stale-floor currently accepts any float, so 0 disables stale detection and values >= 1 make almost every benchmark stale. Likewise, --threshold <= 1 flips the regression gate into “fail on non-regressions”. Please validate these inputs before calling check_regression().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 169 - 187, The CLI
arguments in the main parser flow allow invalid gate ratios, so add validation
before calling check_regression() to enforce the documented semantics: keep
--threshold strictly greater than 1 and --stale-floor strictly between 0 and 1.
Update the argument handling near parser.parse_args(argv) in
check_benchmark_regression.py to reject out-of-range values with a clear error
message, using the existing args.threshold and args.stale_floor symbols.
🧹 Nitpick comments (1)
scripts/reduce_baselines.py (1)

73-79: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep the generated _note truthful for local seeds.

make seed-baselines-local runs this reducer on host timings, but the emitted note always says the data came from ubuntu-latest CI. The checked-in benchmarks/baselines.json already had to override that manually. Please derive the note from the actual source (or add a --source flag) so the reducer output is self-describing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 73 - 79, The generated _note in
reduce_baselines.py is hardcoded to say the baselines came from ubuntu-latest
CI, which is inaccurate for local seed runs. Update the reducer logic around the
output dict construction so the note is derived from the actual input source, or
add a --source option that is threaded into the _note text. Keep the note
builder self-describing by using the existing reduce_baselines.py entrypoint and
its output assembly, rather than relying on a fixed CI-only message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Around line 2-4: The baseline snapshot is currently seeded from a local
Windows run, which does not match the gated Ubuntu benchmark source. Update the
entries in benchmarks/baselines.json using the ubuntu-latest CI
benchmark-results.json artifact before merging, and keep the existing
note/metadata aligned with that source so the gate compares like-for-like
results rather than OS-specific noise.

In `@tests/benchmarks/test_parse_bench.py`:
- Around line 26-27: The benchmark in benchmark(_run) only checks that projects
is a non-empty list, so it can still pass when parsing emits warnings. Update
the assertions around the _warnings result to require that the synthetic corpus
parses without warnings, alongside the existing projects check, so the parse
benchmark fails on new warnings from the parse path.

In `@tests/benchmarks/test_summary_cache_bench.py`:
- Around line 29-32: The benchmark setup in get_cached_projects is only timing
the lookup and never verifies the returned cache state, so add an assertion
after benchmark(...) in each benchmark helper to confirm the expected hit or
miss result. Use the existing symbols set_cached_projects, get_cached_projects,
workspace_fingerprint, and stale_fingerprint to check that the hit case returns
the seeded sample_projects and the miss case returns the empty/missing result,
so malformed cache behavior cannot slip through.

---

Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 169-187: The CLI arguments in the main parser flow allow invalid
gate ratios, so add validation before calling check_regression() to enforce the
documented semantics: keep --threshold strictly greater than 1 and --stale-floor
strictly between 0 and 1. Update the argument handling near
parser.parse_args(argv) in check_benchmark_regression.py to reject out-of-range
values with a clear error message, using the existing args.threshold and
args.stale_floor symbols.

---

Nitpick comments:
In `@scripts/reduce_baselines.py`:
- Around line 73-79: The generated _note in reduce_baselines.py is hardcoded to
say the baselines came from ubuntu-latest CI, which is inaccurate for local seed
runs. Update the reducer logic around the output dict construction so the note
is derived from the actual input source, or add a --source option that is
threaded into the _note text. Keep the note builder self-describing by using the
existing reduce_baselines.py entrypoint and its output assembly, rather than
relying on a fixed CI-only message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: acdca232-8b23-4aa3-b046-022f0310fe57

📥 Commits

Reviewing files that changed from the base of the PR and between ecd08c6 and 5eef897.

📒 Files selected for processing (12)
  • .github/workflows/tests.yml
  • Makefile
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_export_bench.py
  • tests/benchmarks/test_parse_bench.py
  • tests/benchmarks/test_search_bench.py
  • tests/benchmarks/test_summary_cache_bench.py
  • tests/test_check_benchmark_regression.py

Comment thread benchmarks/baselines.json Outdated
Comment thread tests/benchmarks/test_parse_bench.py Outdated
Comment thread tests/benchmarks/test_summary_cache_bench.py Outdated
@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit, please review this PR in more depth.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

@clean6378-max-it I'll do a deeper pass over the PR, with extra attention on the benchmark fixtures, regression-gate logic, baselines, and CI wiring.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 13-18: The exclusion list in EXCLUDED_FROM_GATE is stale and no
longer matches the benchmark timings recorded in baselines.json. Update
check_benchmark_regression.py to revisit the entries for
test_list_workspace_projects_nocache[composers-10] and test_search_full_corpus,
removing or narrowing exclusions where the benchmarks are no longer sub-ms noisy
so the main search benchmark is included in the regression gate.
- Around line 181-187: The threshold validation in the argument parsing path is
too permissive because `args.threshold <= 1` does not reject non-finite values
like `nan` or `inf`. Update the validation in `check_benchmark_regression.py`
around `parser.parse_args(argv)` so the threshold must be finite before
continuing, alongside the existing `> 1` check. Use the existing
`args.threshold` handling and `--threshold` error path to reject any non-finite
input with the same style of stderr message and exit code.

In `@scripts/reduce_baselines.py`:
- Around line 24-28: The _positive_float validator currently accepts non-finite
values like nan and inf because it only checks parsed <= 0. Update
_positive_float in reduce_baselines.py to reject any non-finite float before
returning, so argparse.ArgumentTypeError is raised for invalid --slack inputs.
Keep the existing positive-value check, and use the same validator path so
benchmarks/baselines.json only ever receives finite slack values.
- Around line 65-68: The baseline generation in reduce_baselines.py is silently
skipping entries whose group is not in GATED_GROUPS, which can hide typos or
missing group values and weaken regression gating. Update the logic around the
entry.get("group") check so unknown or absent groups fail fast instead of
continuing, and make sure the error clearly names the offending benchmark/group;
keep the existing groups[group][bench_name] assignment only for validated groups
in the reduce_baselines flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b403ca3-029c-467f-9cc6-2196a70422fd

📥 Commits

Reviewing files that changed from the base of the PR and between ecd08c6 and fcda8ed.

📒 Files selected for processing (13)
  • .github/workflows/tests.yml
  • .gitignore
  • Makefile
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_export_bench.py
  • tests/benchmarks/test_parse_bench.py
  • tests/benchmarks/test_search_bench.py
  • tests/benchmarks/test_summary_cache_bench.py
  • tests/test_check_benchmark_regression.py

Comment thread scripts/check_benchmark_regression.py Outdated
Comment thread scripts/check_benchmark_regression.py Outdated
Comment thread scripts/reduce_baselines.py
Comment thread scripts/reduce_baselines.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 67-78: The baseline aggregation in reduce_baselines.py silently
overwrites entries when normalize_benchmark_name() maps multiple raw benchmark
names to the same bench_name. Before assigning into groups[group][bench_name],
add an explicit duplicate check for the normalized key and raise
BenchmarkDataError when the same group/bench_name appears more than once so
check_regression() keeps full coverage.

In `@tests/test_reduce_baselines.py`:
- Line 64: The baseline note assertion uses the Unicode multiplication symbol,
which triggers the lint warning. Update the test in the relevant assertion to
use ASCII x instead of × so the expected note string matches the lint-safe text,
keeping the check aligned with the baseline content in the test.
- Around line 55-65: The reducer test is missing coverage for the summary-cache
benchmark, so a regression in that path could slip through. Update the
assertions in the test around reduce_baselines output to also verify the
summary-cache entry in groups, using the same pattern as the existing
parse/export/search checks and referencing the groups data returned from
reduce_baselines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 202aeaff-4ebe-4282-9b73-81f47ad1cfb9

📥 Commits

Reviewing files that changed from the base of the PR and between fcda8ed and 87a747c.

📒 Files selected for processing (12)
  • .github/workflows/tests.yml
  • Makefile
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/constants.py
  • tests/benchmarks/test_search_bench.py
  • tests/benchmarks/test_summary_cache_bench.py
  • tests/test_check_benchmark_regression.py
  • tests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (2)
  • benchmarks/README.md
  • tests/benchmarks/constants.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • Makefile
  • tests/benchmarks/test_search_bench.py
  • benchmarks/baselines.json
  • tests/benchmarks/test_summary_cache_bench.py
  • scripts/check_benchmark_regression.py
  • tests/benchmarks/conftest.py

Comment thread scripts/reduce_baselines.py
Comment thread tests/test_reduce_baselines.py
Comment thread tests/test_reduce_baselines.py Outdated
@clean6378-max-it clean6378-max-it marked this pull request as draft June 25, 2026 22:18
@clean6378-max-it clean6378-max-it marked this pull request as ready for review June 25, 2026 22:22
@clean6378-max-it clean6378-max-it removed the request for review from wpak-ai June 25, 2026 22:22
Comment thread tests/benchmarks/test_search_bench.py Outdated
Comment on lines +16 to +19
def _run() -> object:
return bench_client_search_corpus.get(
f"/api/search?q={BENCH_SEARCH_TERM}&all_history=1",
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search benchmark sets CURSOR_CHAT_BROWSER_NO_SEARCH_INDEX=1, so it only measures the live-scan fallback. The FTS index path from #113 is never gated. Consider a second benchmark (or env toggle) for index-backed search, or document in README that this bench tracks live-scan only.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 140-151: summary_cache_dir patches projects/composer-map files but tab summaries use CACHE_DIR + hashed filenames only. Tab benchmarks work today; add a brief comment that _tab_summaries_path relies on the CACHE_DIR monkeypatch (not a separate constant patch).

Comment thread benchmarks/baselines.json Outdated
Comment on lines +19 to +20
"test_summary_cache_lookup[hit]": 9.224067718099102e-05,
"test_summary_cache_lookup[miss]": 9.128770315496628e-05,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_summary_cache_lookup[hit/miss] are gated at ~92µs after 1.5× slack. Sub-millisecond benches are high-variance on shared runners even with --benchmark-min-rounds=5. Monitor for flake; if CI becomes noisy, reintroduce targeted EXCLUDED_FROM_GATE entries or raise slack for lookup-only benches.

normalize_benchmark_name,
)

GATED_GROUPS = ("parse", "export", "search", "summary-cache")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GATED_GROUPS is a hardcoded tuple. Adding a new @pytest.mark.benchmark(group=...) requires updating this list or reduce_baselines fails at refresh time. Document in benchmarks/README.md that new groups must be added here.

Comment on lines +12 to +14
[10, 50],
indirect=True,
ids=["composers-10", "composers-50"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Export is only parametrized at 10/50 composers; parse goes to 200. Consider composers-200 export bench if export is a growth-sensitive path, or note in README why export stops at 50 (runtime budget).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No benchmark for get_cached_composer_id_to_ws / set_cached_composer_id_to_ws. Issue #110 scopes to summary cache from #7; composer-map cache is the remaining uncovered hot path in summary_cache.py.

Comment thread Makefile Outdated
Comment on lines +6 to +9
seed-baselines-local:
@echo "WARNING: seed-baselines-local uses this host's timings; CI gates on ubuntu-latest." >&2
python -m pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts=
python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json --slack 1.5 --source local

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed-baselines-local warns on stderr but still overwrites benchmarks/baselines.json. Consider writing to benchmarks/_raw.json only unless --force, to prevent accidental commit of host-specific baselines.

clean6378-max-it and others added 3 commits June 26, 2026 07:11
test_summary_cache_round_trip calls set/get each round; OS page-cache
state causes 3-5x variation between CI runs (0.000314s vs 0.001137s).
Add to EXCLUDED_FROM_GATE with comment; baseline kept for observation.

Regenerate baselines.json from run 28206913751 (ubuntu-latest, 1.5x
slack). Update README to document the exclusion and rationale.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cppa-cursor-browser: Performance benchmark suite — create benchmark infra + baselines

2 participants