ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#712
Conversation
The single-session cached comparison can't beat the ~1% cross-session drift wall, so pushing either side past 5 runs buys little resolution. Set BENCH_RUNS_BASELINE=5 and clamp /bench N to 1-5 (the per-PR default stays 3 for fast feedback). When a PR shows a small time speedup (<1.5%) that the cheap CI can't confirm, the comment now suggests running /bench-abba. Also exclude /bench-abba from the regular bench trigger so it doesn't double-fire.
New issue_comment job (manual-only: a `/bench-abba` comment on a PR from a repo member; never auto-triggers) that runs the drift-free interleaved A/B/B/A paired benchmark and posts a paired-t CI + exact Wilcoxon test as a PR comment. It occupies the single self-hosted bench server for ~30-40 min, hence manual-only. Optional pair count via `/bench-abba N` (default 20; ~20 resolves 1%, 32 for 0.6%). Adds scripts/bench_abba.sh, which the job invokes to build both binaries (isolated worktree) and run the pairs.
|
/review-ai |
Codex Code ReviewFindings
No Rust/GPU/VM runtime code changed in this PR. I reviewed statically only, per the sandbox instructions. |
Review — two-tier benchmarking flowReviewed the new High
Low
Nothing blocking beyond the High item. The |
AI ReviewPR #712 · 3 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: Stale cached prover binaries silently reused when SHA no longer matches
Claim need_build only checks whether /tmp/abba_run/cli_A and cli_B exist on disk; it never compares the SHA written next to each cached binary (cli_A.sha / cli_B.sha) against the SHA that the current invocation requires (SHA_A / SHA_B). If a different PR runs /bench-abba, or main has advanced since the last run, the script will reuse the previous PR's PR-binary and an older main-binary, print only a soft "verify these match" notice, and produce a verdict that compares the wrong code. Evidence Lines 80-107: Suggested fix Compare SHA_A/SHA_B against the contents of "$WORK/cli_A.sha"/"$WORK/cli_B.sha" (with AI-002: HEAD ref resolution breaks for fork PRs — uses headRefName instead of headRefOid
Claim The workflow resolves the PR head via Evidence bench-abba.yml 'Resolve PR head + pair count' step: Suggested fix Resolve headRefOid in the workflow (as benchmark-pr.yml does) and pass the SHA to bench_abba.sh, or pin the ref before the build. At minimum, echo the resolved SHA_A/SHA_B into the posted GitHub comment so a force-push discrepancy is visible. AI-004: No upper bound on N_PAIRS in bench-abba workflow — bench-server DoS via /bench-abba N
Claim The Evidence bench-abba.yml line 55-56: Suggested fix Clamp the parsed N in the workflow step (mirror the [1,5] clamp pattern already used in benchmark-pr.yml), e.g. a small awk/bc expression that defaults to 20 when missing and otherwise floors to [1,64] (or whatever the documented max is), emitting a AI-007: Binary cache validation only warns, doesn't fail on SHA mismatch
Claim The script checks if cached binaries exist but only prints a warning if their recorded SHAs don't match the requested REFs. It proceeds with potentially stale/wrong binaries, producing silently incorrect benchmark results. Evidence Lines 80–107: Suggested fix Compare AI-008: Cargo build output fully suppressed, making build failures undebuggable
Claim The cargo build command redirects both stdout and stderr to /dev/null. If the build fails, the script exits with a generic error later when the binary is missing, but the actual compilation error is lost. Evidence Line 94 redirects all cargo output to /dev/null. The script otherwise relies on Suggested fix Drop the AI-009: CLI stderr suppressed during prove runs, losing failure diagnostics
Claim The run_prove function redirects stderr to /dev/null. If the prover crashes, OOMs, or errors, the diagnostic output is discarded and the script only reports 'could not parse Proving time'. Evidence Line 112: out="$("$1" prove "$ELF" --private-input "$INPUT" -o "$PROOF" --time 2>/dev/null)". On failure, lines 115-118 print a generic error without the actual CLI error message. Suggested fix Capture stderr separately or tee it to a log file. At minimum, on parse failure, include stderr in the error output: out="$("$1" prove ... 2>&1)" and check exit code before parsing. AI-010: bench-abba job has no timeout-minutes, unlike every other bench workflow
Claim The new job declares Evidence Diff shows the new bench-abba.yml step block ends at line 104 with Suggested fix Add AI-013: Silent git fetch failure may use stale refs
Claim git fetch origin --quiet || echo "(fetch failed; using local refs)" continues execution even if fetch fails. The script then resolves REFs against potentially stale local refs, building/benchmarking the wrong commits. Evidence Line 54: git fetch origin --quiet || echo " (fetch failed; using local refs)". Lines 55-56 then do git rev-parse on REFs that may not be updated. In CI the workflow does fetch-depth: 0, but manual runs or partial fetches could hit this. Suggested fix Make fetch failure a hard error in CI context, or at least warn loudly. For the workflow, the checkout with fetch-depth: 0 already handles this, but the script should be robust standalone. AI-021: Default REF_A is a hardcoded PR branch leftover
Claim The default first argument is hardcoded to Evidence Line 40 sets the default to Suggested fix Default to AI-024: Incremental build in shared target directory could cause subtle issues
Claim Both binaries are built in the same worktree/target directory. The second build (REF_A) is incremental on top of the first (REF_B)'s artifacts. If the two commits have different Cargo.lock or dependency versions, cargo may not correctly rebuild everything. Evidence Line 90-99: git worktree add at SHA_B, then build_cli SHA_B, then build_cli SHA_A in same $WT. The comment says 'shared target dir -> 2nd build is incremental'. Cargo usually handles this, but it's not guaranteed for all changes. Suggested fix Either use separate target directories (CARGO_TARGET_DIR) or do 'cargo clean' between builds. The performance cost is acceptable for a 30-40 minute benchmark. Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Discarded candidates (6) — rejected by the verifier
Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
…agnostics
Confirmed findings from the multi-model review:
- critical/high: SHA-aware binary cache — rebuild when cli_{A,B}.sha don't match
the requested SHAs (was existence-only, so a persistent /tmp on the self-hosted
runner could silently benchmark a previous PR's binaries).
- high: fork-PR head resolution — workflow now resolves headRefOid + fetches
pull/N/head and passes the SHA (origin/<branch> doesn't exist for forks).
- high: clamp /bench-abba N to [2,40] in the workflow (was unbounded -> DoS).
- high: build output -> per-binary log, surfaced on failure (was >/dev/null).
- high: prove runs capture stderr (2>&1) so prover failures are diagnosable.
- medium: add timeout-minutes: 120 so a hang can't strand the bench runner.
- medium: louder warning on git fetch failure.
- low: REF_A is now required (dropped the hardcoded PR #696 default).
- low: fail fast if python3 is missing (before the ~30-min build).
Deliberately kept: shared cargo target across the two worktree builds (incremental
2nd build; cargo recompiles on source change, REBUILD=1 covers dep changes).
|
Addressed the AI review in f6495fb. Mapping each confirmed finding to its fix:
The 6 verifier-rejected candidates I agree with — notably the "Wilcoxon double-counts the observed value" one: that's the intended conservative two-sided exact p (controls Type-I error), not a bug; and the awk-injection one can't trigger since |
Builds a two-tier PR benchmarking flow so we can trust small (~1%) prover deltas without slowing every PR.
Supersedes #710, which GitHub wouldn't reopen after a force-push orphaned its closing commit. Same branch/commits.
TL;DR
/bench): baseline runs3 → 5,/bench Ncapped at 5. Reliably catches ≥~1.5%; on a small (<1.5%) speedup it can't confirm, the comment suggests/bench-abba./bench-abba(manual only): drift-free interleaved A/B/B/A paired benchmark, posts paired-t CI + exact Wilcoxon. Occupies the bench server ~30–40 min, so it never auto-triggers.Changes
benchmark-pr.yml—BENCH_RUNS_BASELINE: 3 → 5, clamp/bench Nto1–5(per-PR default stays 3 for fast feedback); small-speedup escalation hint;/bench-abbaexcluded from the/benchgate.bench-abba.yml(new) +scripts/bench_abba.sh(new) — the on-demand tiebreaker. Member-gated, manual-only, posts a "server occupied" notice on start and the result as a PR comment./bench-abba Noverrides the default 20 pairs.Analysis — why this design
1. The noise floor (50 historical baselines, Apr–Jun)
All ran at n=3. Within-session run-to-run CV ≈ 0.5% (fib) / ~0.85% per run (ethrex). The old hard 5% gate is ~5–8× coarser than the measurements actually support.
2. Drift is the wall — and it's age-dependent
Isolating optimization-free windows from the 50 baselines: cross-session drift is ~0 same-day (May 22: 9 pushes, CV 0.30% ≈ the n=3 sampling floor of 0.28%; Jun 18 likewise) and grows to ~0.5% over ~2 weeks. So a cached baseline comparison (PR now vs baseline recorded earlier) is limited by drift, not by run count — adding runs can't beat a stale baseline. This is the core reason the cheap tier has a ceiling.
3. Why cap the cheap tier at 5
Cheap-tier reliable floor (fresh baseline): 3/3 → ~1.9%, 5/5 → ~1.5%, 10/10 → ~1.1%. Doubling 5→10 only shaves ~0.4pts and still can't cross the ~1% drift wall — not worth the per-PR latency. So we cap at 5 and let Tier 2 handle anything finer.
4. Why ABBA (pairing) for the tiebreaker
Interleaving A/B/B/A and analyzing the per-pair difference cancels machine drift (it hits both sides equally) — so the tiebreaker is immune to the drift wall that limits the cached path. Pairing is also far more powerful than an unpaired two-sample comparison.
5. Sizing & power (ethrex pair-diff sd ≈ 1.2%)
For 80% power: ~12 pairs → 1%, ~18 → 0.8%, ~32 → 0.6% (variance scales with 1/resolution², so 0.6% costs ~3× a 1% call).
Default 20 (solid on 0.8–1%); use
/bench-abba 32for ~0.6%.6. Stats: paired-t + exact Wilcoxon (and why both)
7. Validation on #696 (
perf/logup-fingerprint-constants)[-0.20%, +1.39%], Wilcoxon p≈0.15 (mean inflated by 2 outlier pairs).[+0.38%, +1.41%], exact Wilcoxon p≈0.002, 19/24 pairs favor the PR.Usage
/bench→ Tier 1 (3 runs, ~quick),/bench 5for max cheap precision./bench-abba→ Tier 2 (20 pairs, ~33 min),/bench-abba 32for ~0.6% resolution.Notes
/bench-abbagoes live after this merges to main (comment-triggered workflows + the script run from the default branch). The script also lives onbench/abba-manualfor manual server runs in the meantime.