Skip to content

test: add summary-cache pytest-benchmark suite with CI regression gate#120

Merged
clean6378-max-it merged 7 commits into
masterfrom
test/summary-cache-benchmarks
Jun 25, 2026
Merged

test: add summary-cache pytest-benchmark suite with CI regression gate#120
clean6378-max-it merged 7 commits into
masterfrom
test/summary-cache-benchmarks

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Closes #115

Summary

Adds a pytest-benchmark suite for services/summary_cache.py with synthetic fixtures and a CI regression gate. No production code changes.

  • New tests/benchmarks/test_summary_cache_bench.py measures cache hit, cache miss, fingerprint cost (10/50/200 workspace entries), and round-trip write+read latency
  • New tests/benchmarks/conftest.py builds synthetic workspace trees and isolates cache files to temp dirs (no real Cursor storage)
  • benchmarks/baselines.json records initial gated means (1.5× slack from local reference run; refresh from ubuntu-latest CI artifact after first green job)
  • scripts/check_benchmark_regression.py fails CI when any benchmark mean exceeds baseline by >20%
  • pyproject.toml: add pytest-benchmark dev dep; default --benchmark-skip so normal pytest stays fast
  • .github/workflows/tests.yml: new benchmarks job on ubuntu-latest

Sprint item #7 (June Week 4, Wednesday). Sequences before Thursday #8 (unified perf benchmark suite).

Test plan

  • pytest tests/benchmarks/test_summary_cache_bench.py --benchmark-only -o addopts=
  • python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
  • pytest tests/benchmarks/test_summary_cache_bench.py -o addopts= (benchmarks skipped via --benchmark-skip)
  • python -m unittest discover tests -q
  • Confirm benchmarks CI job is green on ubuntu-latest

Summary by CodeRabbit

  • New Features
    • Added CI performance benchmarking for summary-cache (cache hit/miss, workspace fingerprinting, and cache round-trips).
  • Bug Fixes
    • Updated test execution to avoid inheriting unrelated pytest options.
  • Chores
    • Introduced a benchmark regression gate with updated baselines and CI artifact handling; refreshed benchmark baseline data and ignored generated benchmark outputs.
    • Enabled pytest-benchmark in dev tooling and configured pytest benchmark defaults.
  • Tests
    • Added new benchmark test coverage and a dedicated unit test suite for the regression gate logic.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds summary-cache benchmarks, refreshed baselines, a regression-check script, and CI workflow steps to run, validate, and upload benchmark results.

Changes

Summary Cache Benchmark Validation

Layer / File(s) Summary
Benchmark config and fixtures
pyproject.toml, requirements-lock.txt, tests/benchmarks/conftest.py
Adds pytest-benchmark to development dependencies, configures pytest benchmark defaults and marker registration, and adds synthetic workspace fixtures for summary-cache benchmarks.
Benchmark cases
tests/benchmarks/test_summary_cache_bench.py
Adds benchmark cases for cache hits, cache misses, workspace fingerprinting at 10/50/200 entries, and a set-then-get round trip.
Regression script and baselines
benchmarks/baselines.json, scripts/check_benchmark_regression.py, tests/test_check_benchmark_regression.py
Refreshes stored benchmark timings and adds a CLI that loads benchmark results, compares means to stored baselines, and reports regressions or malformed inputs, with test coverage for comparison and validation paths.
CI benchmark job
.github/workflows/tests.yml, .gitignore
Adds a workflow job that runs the benchmarks, checks regression output, uploads benchmark-results.json, and ignores generated benchmark artifacts.

Sequence Diagram(s)

sequenceDiagram
  participant benchmarks_job as benchmarks job
  participant pytest as pytest
  participant regression_script as scripts/check_benchmark_regression.py
  participant upload_artifact as actions/upload-artifact

  benchmarks_job->>pytest: run tests/benchmarks/ with --benchmark-only
  pytest-->>benchmarks_job: benchmark-results.json
  benchmarks_job->>regression_script: compare benchmark-results.json to benchmarks/baselines.json
  regression_script-->>benchmarks_job: exit code 0, 1, or 2
  benchmarks_job->>upload_artifact: upload benchmark-results.json with if: always()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • wpak-ai
  • bradjin8

Poem

(_/)
A rabbit hopped through benchmark snow,
With cache hits warm and misses slow.
Baselines glowed in CI light,
And regressions stayed out of sight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a summary-cache benchmark suite and CI regression gate.
Linked Issues check ✅ Passed The PR matches the issue goals: synthetic summary-cache benchmarks, recorded baselines, and a CI gate for >20% regressions.
Out of Scope Changes check ✅ Passed The changes are focused on benchmark validation and CI support; no unrelated feature or production code changes stand out.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/summary-cache-benchmarks

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
scripts/check_benchmark_regression.py (1)

95-99: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

A missing benchmark result silently passes the gate.

When a baseline benchmark has no current result, the script only prints a WARN and continues. If a gated benchmark is renamed, removed, or fails to run, the regression gate will report success and the performance coverage silently disappears. Consider treating a missing gated benchmark as a failure (or a separate exit code) so coverage gaps are surfaced.

♻️ Possible approach
     failures: list[str] = []
+    missing: list[str] = []
     for name, base in baseline_means.items():
         cur = flat.get(name)
         if cur is None:
-            print(f"WARN: no current result for baseline {name!r}; skipping")
+            print(f"MISSING: no current result for baseline {name!r}")
+            missing.append(name)
             continue

Then fail (or warn-as-error) when missing is non-empty, per your policy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 95 - 99, The benchmark
gating in check_benchmark_regression.py currently skips missing current results
in the baseline loop, which can let renamed, removed, or failing benchmarks pass
unnoticed. Update the logic around the baseline_means/flat comparison so missing
gated benchmarks are collected and treated as a failure (or a distinct non-zero
exit condition) instead of only printing a warning. Make the behavior explicit
in the final exit path of the script so coverage gaps are surfaced alongside the
existing regression checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Around line 1-15: The benchmark baselines in the summary-cache group are still
using Windows-derived values with 1.5x slack, but the regression gate runs on
ubuntu-latest. Refresh the values from a green ubuntu-latest benchmark artifact
and keep the _note consistent with the source platform; if platform-matched
baselines are not yet available, make the gate non-blocking until they are.
Focus on the baselines.json entries under groups.summary-cache so the
check_benchmark_regression.py comparisons are against the same OS.

In `@pyproject.toml`:
- Around line 39-44: The pytest configuration currently applies --benchmark-skip
globally in [tool.pytest.ini_options], which breaks the regular integration
pytest run because pytest-benchmark is not installed there. Update the test
setup so the benchmark plugin is available wherever this addopts setting is
used, or move the --benchmark-skip option to only the benchmark-specific job;
check the pytest configuration and the workflows that invoke python -m pytest,
especially the benchmark-related markers and options.

---

Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 95-99: The benchmark gating in check_benchmark_regression.py
currently skips missing current results in the baseline loop, which can let
renamed, removed, or failing benchmarks pass unnoticed. Update the logic around
the baseline_means/flat comparison so missing gated benchmarks are collected and
treated as a failure (or a distinct non-zero exit condition) instead of only
printing a warning. Make the behavior explicit in the final exit path of the
script so coverage gaps are surfaced alongside the existing regression checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1cc8514e-6127-42e8-95a6-134382f2ed25

📥 Commits

Reviewing files that changed from the base of the PR and between 793b6f2 and c126227.

📒 Files selected for processing (7)
  • .github/workflows/tests.yml
  • .gitignore
  • benchmarks/baselines.json
  • pyproject.toml
  • scripts/check_benchmark_regression.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_summary_cache_bench.py

Comment thread benchmarks/baselines.json
Comment thread pyproject.toml
@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit, review this PR in more depth

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== .github/workflows/tests.yml ==="
cat -n .github/workflows/tests.yml

echo ""
echo "=== pyproject.toml ==="
cat -n pyproject.toml

echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat -n scripts/check_benchmark_regression.py

echo ""
echo "=== tests/benchmarks/conftest.py ==="
cat -n tests/benchmarks/conftest.py

echo ""
echo "=== tests/benchmarks/test_summary_cache_bench.py ==="
cat -n tests/benchmarks/test_summary_cache_bench.py

echo ""
echo "=== benchmarks/baselines.json ==="
cat -n benchmarks/baselines.json

echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignore

echo ""
echo "=== requirements-lock.txt ==="
cat -n requirements-lock.txt

echo ""
echo "=== services/summary_cache.py ==="
cat -n services/summary_cache.py

echo ""
echo "=== Directory tree of tests/benchmarks/ ==="
tree tests/benchmarks/ 2>/dev/null || find tests/benchmarks/ -type f

echo ""
echo "=== Directory tree of scripts/ ==="
tree scripts/ 2>/dev/null || find scripts/ -type f

echo ""
echo "=== benchmarks/ directory ==="
tree benchmarks/ 2>/dev/null || find benchmarks/ -type f

Length of output: 44058



@clean6378-max-it Here is a deep review of PR #120.


Deep Review: PR #120 — pytest-benchmark suite for summary_cache.py

The PR is well-structured overall — no production code is touched, the fixture isolation via monkeypatch is correct, and the regression gate script has solid error-handling. However, there are several issues ranging from critical to minor that should be addressed before merge.


🔴 Critical

1. Baseline name format may never match pytest-benchmark output (benchmarks/baselines.json + scripts/check_benchmark_regression.py)

The regression gate in check_regression() (Line 96) looks up current results by name from the baselines. The baselines use short names like test_summary_cache_hit and test_fingerprint_workspace_entries[10].

pytest-benchmark emits the name field as the short test node name (e.g. test_summary_cache_hit) in some configurations, but in others it emits the full node ID including the module path (tests/benchmarks/test_summary_cache_bench.py::test_summary_cache_hit). If the format doesn't match, every baseline entry will show as MISSING and the gate will always fail (exit code 1) — or always pass if the lookup silently misses.

You should verify the exact format by inspecting an actual benchmark-results.json produced by the CI job and cross-referencing the name field against the keys in baselines.json. Consider adding a normalization step in load_results() that strips the module prefix:

# In load_results(), after extracting `name`:
# Normalize "tests/benchmarks/test_x.py::test_foo[param]" → "test_foo[param]"
name = str(name).rsplit("::", 1)[-1]

2. Baselines appear manually fabricated — future-dated and unverified (benchmarks/baselines.json)

"updated": "2026-06-25T12:00:00Z",

The timestamp is in the future and the _note claims these values come from "ubuntu-latest CI benchmark-results.json", but there is no CI artifact proving that. Values like 6.3e-05 for both cache hit and cache miss (identical to 6 decimal places) suggest they were hand-written, not measured. If these baselines are too tight for a slower runner, the gate will spuriously fail; if too loose, the gate provides no real protection.

Recommendation: Do not commit hand-crafted baselines. Run the benchmarks in CI first, capture the artifact, update the file from the actual numbers, and commit that as a follow-up.

3. benchmarks CI job has no needs: dependency on unittest (.github/workflows/tests.yml Line 219)

The benchmark job runs in parallel with unit tests. If unit tests fail due to a production code regression, benchmarks can still be reported as "passing" — giving a misleading green signal.

# Add this to the benchmarks job:
needs: [unittest]

🟡 Medium

4. addopts = "--benchmark-disable" is missing from pyproject.toml (Line 39–43)

The PR objectives state: "adds default benchmark skipping for normal test runs." But the [tool.pytest.ini_options] section has no addopts at all:

[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
    "benchmark: performance benchmarks (pytest-benchmark)",
]

Since testpaths includes tests/, tests/benchmarks/ is collected on every plain pytest invocation. Without --benchmark-disable, pytest-benchmark will execute benchmark functions normally (one call each, with timing overhead). This slows down the developer loop and contradicts the stated goal.

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "--benchmark-disable"
markers = [
    "benchmark: performance benchmarks (pytest-benchmark)",
]

Note: the CI benchmark job already overrides this with -o addopts=, so adding it won't break CI.

5. sys.path manipulation in conftest.py is fragile (Lines 12–14)

REPO_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
if REPO_ROOT not in sys.path:
    sys.path.insert(0, REPO_ROOT)

This is three levels of dirname that breaks silently if the file moves. The idiomatic solution is to install the package in editable mode in CI (pip install -e .) and remove the manual path insertion entirely. Alternatively, the [tool.pytest.ini_options] section can set pythonpath = ["."] (pytest ≥ 7.0):

[tool.pytest.ini_options]
pythonpath = ["."]

6. pytest-benchmark not pinned in requirements-lock.txt (.github/workflows/tests.yml Line 236)

python -m pip install 'pytest>=8,<9' 'pytest-benchmark>=4,<5'

All other runtime deps are resolved from requirements-lock.txt for determinism, but pytest-benchmark is installed ad-hoc. A transitive dep (e.g. py-cpuinfo, attrs) could change versions silently. pytest-benchmark should be pinned in the lock file, or at minimum pinned to a specific patch version in the workflow.

7. stale_fingerprint relies on an undocumented assumption (tests/benchmarks/conftest.py Line 88–89)

`@pytest.fixture`
def stale_fingerprint(workspace_fingerprint: dict[str, Any]) -> dict[str, Any]:
    return {**workspace_fingerprint, "global_db_mtime_ns": 9_999_999_999}

This works because global_db_path=None causes _file_mtime_ns(None) to return None, making the real fingerprint's global_db_mtime_ns equal to None. Setting it to 9_999_999_999 creates a mismatch. But this is completely implicit — if fingerprint_workspace_storage ever changes its handling of None paths, this fixture silently stops simulating a miss.

A more robust approach is to use a sentinel that always differs:

`@pytest.fixture`
def stale_fingerprint(workspace_fingerprint: dict[str, Any]) -> dict[str, Any]:
    """Return a fingerprint guaranteed to differ from the stored one."""
    return {**workspace_fingerprint, "rules_digest": "deadbeefdeadbeef"}

This works regardless of what global_db_mtime_ns is.


🔵 Minor

8. Missing tests/benchmarks/__init__.py

tests/benchmarks/
├── conftest.py
└── test_summary_cache_bench.py

The directory has no __init__.py. While modern pytest handles this correctly, the rest of tests/ likely uses __init__.py files (consistent with python -m unittest discover). Adding one avoids potential namespace collisions if identically-named modules exist elsewhere.

9. load_baseline_means silently skips non-dict group values (scripts/check_benchmark_regression.py Line 69)

if not isinstance(value, dict):
    continue  # silently skips

A typo in baselines.json (e.g. a string value for a group) is silently ignored, which could cause a group's benchmarks to never be gated. Consider issuing a warning or raising BenchmarkDataError.

10. Redundant str() cast in load_results (scripts/check_benchmark_regression.py Line 43)

name = str(name)  # name was already set from entry["name"] above

name is set on Line 37 directly from entry["name"], which would have been accepted as str already. The redundant cast can be removed.

11. Missing blank line before first test function (tests/benchmarks/test_summary_cache_bench.py Line 16)

PEP 8 requires two blank lines between the import block and the first top-level definition:

from services.summary_cache import (
    fingerprint_workspace_storage,
    get_cached_projects,
    set_cached_projects,
)
                                         # ← two blank lines needed here
`@pytest.mark.benchmark`(group="summary-cache")
def test_summary_cache_hit(...):

12. No unit tests for check_benchmark_regression.py

The regression gate script has good internal logic (load_results, load_baseline_means, check_regression) but zero test coverage. Given that this script is the only thing blocking regressions in CI, a small test file (e.g. tests/test_check_benchmark_regression.py) covering the happy path, missing-benchmark, and threshold-exceeded cases would greatly increase confidence.


Summary table

Severity Issue File
🔴 Critical Baseline name format may mismatch pytest-benchmark output baselines.json, check_benchmark_regression.py
🔴 Critical Baselines are future-dated and appear manually fabricated baselines.json
🔴 Critical benchmarks job has no needs: [unittest] tests.yml
🟡 Medium Missing addopts = "--benchmark-disable" pyproject.toml
🟡 Medium Fragile sys.path manipulation conftest.py
🟡 Medium pytest-benchmark not pinned in lock file tests.yml, requirements-lock.txt
🟡 Medium stale_fingerprint relies on implicit None assumption conftest.py
🔵 Minor Missing __init__.py in tests/benchmarks/
🔵 Minor Silent skip of non-dict group values check_benchmark_regression.py
🔵 Minor Redundant str() cast check_benchmark_regression.py
🔵 Minor Missing blank line after imports test_summary_cache_bench.py
🔵 Minor No unit tests for the regression gate script

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 17-19: The normalize_benchmark_name helper currently uses
rsplit("::", 1), which can incorrectly strip parts of parametrized benchmark
names when parameter values include "::". Update normalize_benchmark_name to
split on the first "::" only so pytest node prefixes are removed without
truncating names like test_x[param::v], and keep the behavior consistent for
both short and full benchmark names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c745c7f7-791c-4802-9b4a-d7dc19639498

📥 Commits

Reviewing files that changed from the base of the PR and between e4d79a6 and 83bb248.

📒 Files selected for processing (8)
  • .github/workflows/tests.yml
  • benchmarks/baselines.json
  • pyproject.toml
  • scripts/check_benchmark_regression.py
  • tests/benchmarks/__init__.py
  • tests/benchmarks/conftest.py
  • tests/benchmarks/test_summary_cache_bench.py
  • tests/test_check_benchmark_regression.py
✅ Files skipped from review due to trivial changes (1)
  • benchmarks/baselines.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/benchmarks/test_summary_cache_bench.py
  • .github/workflows/tests.yml

Comment thread scripts/check_benchmark_regression.py Outdated
Comment thread .github/workflows/tests.yml Outdated
Comment thread requirements-lock.txt
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml
Comment thread tests/benchmarks/test_summary_cache_bench.py
Comment thread tests/benchmarks/conftest.py
Comment thread benchmarks/baselines.json Outdated
Comment thread .github/workflows/tests.yml
Comment thread scripts/check_benchmark_regression.py
@bradjin8 bradjin8 requested a review from wpak-ai June 25, 2026 18:34
@cppalliance cppalliance deleted a comment from coderabbitai Bot Jun 25, 2026

@wpak-ai wpak-ai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the docstrings and you're ready to go.

Comment thread .github/workflows/tests.yml Outdated
Comment thread tests/benchmarks/conftest.py Outdated
@clean6378-max-it clean6378-max-it merged commit ecd08c6 into master Jun 25, 2026
17 checks passed
clean6378-max-it added a commit that referenced this pull request Jun 25, 2026
@clean6378-max-it clean6378-max-it deleted the test/summary-cache-benchmarks branch June 25, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cppa-cursor-browser: Summary cache benchmark validation

3 participants