Skip to content

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#712

Merged
MauroToscano merged 4 commits into
mainfrom
bench/baseline-runs-10
Jun 26, 2026
Merged

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#712
MauroToscano merged 4 commits into
mainfrom
bench/baseline-runs-10

Conversation

@MauroToscano

Copy link
Copy Markdown
Contributor

Builds a two-tier PR benchmarking flow so we can trust small (~1%) prover deltas without slowing every PR.

Supersedes #710, which GitHub wouldn't reopen after a force-push orphaned its closing commit. Same branch/commits.

TL;DR

  • Tier 1 — cheap screen (auto on /bench): baseline runs 3 → 5, /bench N capped at 5. Reliably catches ≥~1.5%; on a small (<1.5%) speedup it can't confirm, the comment suggests /bench-abba.
  • Tier 2 — /bench-abba (manual only): drift-free interleaved A/B/B/A paired benchmark, posts paired-t CI + exact Wilcoxon. Occupies the bench server ~30–40 min, so it never auto-triggers.

Changes

benchmark-pr.ymlBENCH_RUNS_BASELINE: 3 → 5, clamp /bench N to 1–5 (per-PR default stays 3 for fast feedback); small-speedup escalation hint; /bench-abba excluded from the /bench gate.
bench-abba.yml (new) + scripts/bench_abba.sh (new) — the on-demand tiebreaker. Member-gated, manual-only, posts a "server occupied" notice on start and the result as a PR comment. /bench-abba N overrides the default 20 pairs.

Analysis — why this design

1. The noise floor (50 historical baselines, Apr–Jun)

All ran at n=3. Within-session run-to-run CV ≈ 0.5% (fib) / ~0.85% per run (ethrex). The old hard 5% gate is ~5–8× coarser than the measurements actually support.

2. Drift is the wall — and it's age-dependent

Isolating optimization-free windows from the 50 baselines: cross-session drift is ~0 same-day (May 22: 9 pushes, CV 0.30% ≈ the n=3 sampling floor of 0.28%; Jun 18 likewise) and grows to ~0.5% over ~2 weeks. So a cached baseline comparison (PR now vs baseline recorded earlier) is limited by drift, not by run count — adding runs can't beat a stale baseline. This is the core reason the cheap tier has a ceiling.

3. Why cap the cheap tier at 5

Cheap-tier reliable floor (fresh baseline): 3/3 → ~1.9%, 5/5 → ~1.5%, 10/10 → ~1.1%. Doubling 5→10 only shaves ~0.4pts and still can't cross the ~1% drift wall — not worth the per-PR latency. So we cap at 5 and let Tier 2 handle anything finer.

4. Why ABBA (pairing) for the tiebreaker

Interleaving A/B/B/A and analyzing the per-pair difference cancels machine drift (it hits both sides equally) — so the tiebreaker is immune to the drift wall that limits the cached path. Pairing is also far more powerful than an unpaired two-sample comparison.

5. Sizing & power (ethrex pair-diff sd ≈ 1.2%)

For 80% power: ~12 pairs → 1%, ~18 → 0.8%, ~32 → 0.6% (variance scales with 1/resolution², so 0.6% costs ~3× a 1% call).

pairs ~min power 1.0% power 0.8% power 0.6%
16 27 92% 76% 52%
20 (default) 33 96% 85% 61%
24 40 98% 90% 69%
32 53 ~100% 96% 81%

Default 20 (solid on 0.8–1%); use /bench-abba 32 for ~0.6%.

6. Stats: paired-t + exact Wilcoxon (and why both)

  • paired-t → magnitude + 95% CI; efficient on clean data, but a single glitchy run can yank the mean.
  • exact Wilcoxon signed-rank → robust cross-check, computed from the exact null distribution via a pure-stdlib generating-function DP (no scipy; the normal approximation is loose at small n).
  • median alongside. The verdict reconciles them: agree → confident; disagree → an outlier is distorting the mean, trust the median + add pairs. The script also prints server-stability metrics (run-to-run jitter + within-session drift) so candidate servers can be compared.

7. Validation on #696 (perf/logup-fingerprint-constants)

  • 8 pairs: inconclusive — mean +0.60%, median +0.18%, CI [-0.20%, +1.39%], Wilcoxon p≈0.15 (mean inflated by 2 outlier pairs).
  • 24 pairs: real ~0.9% speedup — mean +0.90%, median +0.88%, CI [+0.38%, +1.41%], exact Wilcoxon p≈0.002, 19/24 pairs favor the PR.
  • Lesson baked into the defaults: don't trust sub-1% calls at n≈8; ~20+ pairs needed on this workload.

Usage

  • /bench → Tier 1 (3 runs, ~quick), /bench 5 for max cheap precision.
  • /bench-abba → Tier 2 (20 pairs, ~33 min), /bench-abba 32 for ~0.6% resolution.

Notes

  • /bench-abba goes live after this merges to main (comment-triggered workflows + the script run from the default branch). The script also lives on bench/abba-manual for manual server runs in the meantime.
  • Follow-up (not in this PR): a nightly baseline refresh keeps the Tier-1 cached baseline fresh (drift ≈ 0), which is what lets the cheap tier stay near its ~1.5% floor instead of drifting toward the ~1% wall.

The single-session cached comparison can't beat the ~1% cross-session drift wall,
so pushing either side past 5 runs buys little resolution. Set
BENCH_RUNS_BASELINE=5 and clamp /bench N to 1-5 (the per-PR default stays 3 for
fast feedback). When a PR shows a small time speedup (<1.5%) that the cheap CI
can't confirm, the comment now suggests running /bench-abba. Also exclude
/bench-abba from the regular bench trigger so it doesn't double-fire.
New issue_comment job (manual-only: a `/bench-abba` comment on a PR from a repo
member; never auto-triggers) that runs the drift-free interleaved A/B/B/A paired
benchmark and posts a paired-t CI + exact Wilcoxon test as a PR comment. It
occupies the single self-hosted bench server for ~30-40 min, hence manual-only.
Optional pair count via `/bench-abba N` (default 20; ~20 resolves 1%, 32 for
0.6%). Adds scripts/bench_abba.sh, which the job invokes to build both binaries
(isolated worktree) and run the pairs.
@MauroToscano

Copy link
Copy Markdown
Contributor Author

/review-ai

@github-actions

Copy link
Copy Markdown

Codex Code Review

Findings

  1. Medium - ABBA can reuse stale binaries for the wrong PR/base commit
    scripts/bench_abba.sh only rebuilds when /tmp/abba_run/cli_A or cli_B is missing, but it does not compare cli_A.sha/cli_B.sha to the requested SHA_A/SHA_B. On a persistent self-hosted runner, a later /bench-abba for another PR can silently benchmark the previous PR’s binaries and post a valid-looking result. Make the cache key include both SHAs, or force rebuild when the recorded SHA files do not exactly match.

  2. Medium - PR head resolution benchmarks the wrong ref for forked or colliding branch names
    .github/workflows/bench-abba.yml records only headRefName, then line 76 passes origin/$HEAD_REF. For fork PRs that ref usually does not exist; worse, if the base repo has a branch with the same name, the benchmark will run that base-repo branch instead of the PR. Use headRefOid like the existing benchmark workflow, or explicitly fetch refs/pull/<PR>/head and pass the resolved SHA to the script.

  3. Medium - /bench-abba N is unbounded and can tie up or exhaust the bench runner
    .github/workflows/bench-abba.yml accepts any digit count and scripts/bench_abba.sh runs that many pairs. Large values can occupy the single self-hosted bench runner for days, and the exact Wilcoxon DP at scripts/bench_abba.sh grows with the rank sum. Clamp to a documented safe range, e.g. 1..32 or 1..64, before starting the job.

No Rust/GPU/VM runtime code changed in this PR. I reviewed statically only, per the sandbox instructions.

Comment thread scripts/bench_abba.sh
Comment thread scripts/bench_abba.sh Outdated
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review — two-tier benchmarking flow

Reviewed the new bench-abba.yml + bench_abba.sh and the benchmark-pr.yml changes. This is CI/benchmark tooling (no VM/prover code), so the surface is YAML, bash, and embedded Python stats. The design and the stats (exact Wilcoxon via the generating-function DP, paired-t, ABBA interleaving) look sound. Findings:

High

  • Stale-binary reuse in the automated path (bench_abba.sh:81-84, inline). The build cache keys on file existence only, never the requested SHA, and the workflow step never sets REBUILD=1. On the persistent self-hosted runner a second /bench-abba will silently benchmark the previous run's binaries and post a confident-but-wrong result. Compare the cached .sha against SHA_A/SHA_B, or set REBUILD=1 in the workflow.

Low

  • Build failures are undiagnosable (bench_abba.sh:94, inline). cargo build … >/dev/null 2>&1 discards the error; the failure-path comment then shows no cause.
  • Verdict labeling asymmetry (Python verdict block). (lo > 0) != (p < 0.05) only catches improvement-side disagreement between the t-CI and Wilcoxon. A clear regression CI (hi < 0) with a non-significant Wilcoxon falls through to INCONCLUSIVE instead of BORDERLINE. Minor, since the median is still printed.
  • Fork PRsorigin/$HEAD_REF won't resolve for cross-fork PR branches (the head isn't pushed to origin). Fine if /bench-abba is intended for same-repo branches only, which the member-gating suggests.

Nothing blocking beyond the High item. The /bench vs /bench-abba gate exclusion and the comment-body-via-env (injection-safe) handling are both correct.

@github-actions

Copy link
Copy Markdown

AI Review

PR #712 · 3 changed files

Findings

Status Sev Location Finding Found by
confirmed critical scripts/bench_abba.sh:82 Stale cached prover binaries silently reused when SHA no longer matches moonmath
zro/minimax-m3
kimi
openrouter/moonshotai/kimi-k2.7-code
glm
openrouter/z-ai/glm-5.2
confirmed high .github/workflows/bench-abba.yml:52 HEAD ref resolution breaks for fork PRs — uses headRefName instead of headRefOid minimax
minimax/MiniMax-M3
nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
kimi
openrouter/moonshotai/kimi-k2.7-code
glm
openrouter/z-ai/glm-5.2
confirmed high .github/workflows/bench-abba.yml:55 No upper bound on N_PAIRS in bench-abba workflow — bench-server DoS via /bench-abba N minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
confirmed high scripts/bench_abba.sh:82 Binary cache validation only warns, doesn't fail on SHA mismatch nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
minimax
minimax/MiniMax-M3
confirmed high scripts/bench_abba.sh:94 Cargo build output fully suppressed, making build failures undebuggable nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
kimi
openrouter/moonshotai/kimi-k2.7-code
minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
confirmed high scripts/bench_abba.sh:112 CLI stderr suppressed during prove runs, losing failure diagnostics nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
confirmed medium .github/workflows/bench-abba.yml:29 bench-abba job has no timeout-minutes, unlike every other bench workflow moonmath
zro/minimax-m3
confirmed medium scripts/bench_abba.sh:54 Silent git fetch failure may use stale refs nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
confirmed low scripts/bench_abba.sh:40 Default REF_A is a hardcoded PR branch leftover kimi
openrouter/moonshotai/kimi-k2.7-code
nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
moonmath
zro/minimax-m3
confirmed low scripts/bench_abba.sh:94 Incremental build in shared target directory could cause subtle issues nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
confirmed low scripts/bench_abba.sh:137 python3 is invoked without checking availability — depends on implicit toolchain on bench runner minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Stale cached prover binaries silently reused when SHA no longer matches
  • Status: confirmed
  • Severity: critical
  • Location: scripts/bench_abba.sh:82
  • Found by: moonmath:zro/minimax-m3, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

need_build only checks whether /tmp/abba_run/cli_A and cli_B exist on disk; it never compares the SHA written next to each cached binary (cli_A.sha / cli_B.sha) against the SHA that the current invocation requires (SHA_A / SHA_B). If a different PR runs /bench-abba, or main has advanced since the last run, the script will reuse the previous PR's PR-binary and an older main-binary, print only a soft "verify these match" notice, and produce a verdict that compares the wrong code.

Evidence

Lines 80-107: need_build=0; if [ "${REBUILD:-0}" = "1" ] || [ ! -x "$WORK/cli_A" ] || [ ! -x "$WORK/cli_B" ]; then need_build=1; fi. The else branch prints cached A/B SHAs and the requested A/B SHAs but never rebuilds on mismatch. The "/tmp/abba_run" cache lives on a persistent self-hosted runner, and the workflow invokes the script once per issue (different PRs and different main SHAs over time). The verdict output drives a PR-decision comment, so silently wrong measurements are a correctness regression.

Suggested fix

Compare SHA_A/SHA_B against the contents of "$WORK/cli_A.sha"/"$WORK/cli_B.sha" (with cat ... 2&gt;/dev/null) and add them to the need_build condition, e.g.: CACHED_A=$(cat "$WORK/cli_A.sha" 2&gt;/dev/null || echo "") and require [ "$CACHED_A" = "$SHA_A" ] &amp;&amp; [ "$CACHED_B" = "$SHA_B" ]. If they don't match, force a rebuild and overwrite the cached .sha files.

AI-002: HEAD ref resolution breaks for fork PRs — uses headRefName instead of headRefOid
  • Status: confirmed
  • Severity: high
  • Location: .github/workflows/bench-abba.yml:52
  • Found by: minimax:minimax/MiniMax-M3, nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The workflow resolves the PR head via --json headRefName, then passes "origin/$HEAD_REF" to bench_abba.sh. For PRs from forks, the head branch lives in the fork, not in origin/, so git rev-parse origin/$HEAD_REF fails and the workflow exits with a cryptic error.

Evidence

bench-abba.yml 'Resolve PR head + pair count' step: HEAD_REF=$(gh pr view "$PR_NUM" --repo "$GITHUB_REPOSITORY" --json headRefName -q .headRefName) then runs scripts/bench_abba.sh "origin/$HEAD_REF" origin/main "$PAIRS". Compare benchmark-pr.yml line 84 which uses --json headRefOid -q .headRefOid and checks out by SHA. bench_abba.sh line 55 then does git rev-parse "$REF_A" at a later point in time.

Suggested fix

Resolve headRefOid in the workflow (as benchmark-pr.yml does) and pass the SHA to bench_abba.sh, or pin the ref before the build. At minimum, echo the resolved SHA_A/SHA_B into the posted GitHub comment so a force-push discrepancy is visible.

AI-004: No upper bound on N_PAIRS in bench-abba workflow — bench-server DoS via /bench-abba N
  • Status: confirmed
  • Severity: high
  • Location: .github/workflows/bench-abba.yml:55
  • Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The pairs value is parsed from the comment body with no clamp. A member-level comment like /bench-abba 10000 (or any large number) launches 2*N prove runs (~50s each on ethrex), occupying the single self-hosted bench runner for hours or days.

Evidence

bench-abba.yml line 55-56: N=$(echo "$COMMENT_BODY" | sed -n 's|^/bench-abba[[:space:]]*\([0-9]\+\).*|\1|p'); echo "pairs=${N:-20}" &gt;&gt; "$GITHUB_OUTPUT" is unvalidated, and downstream scripts/bench_abba.sh "...$PAIRS" consumes it directly. bench_abba.sh only warns on odd parity (lines 59-61) and otherwise accepts any N. The comment-help line in the benchmark-pr.yml escalation message says "32 for ~0.6%" — implying 32 is the intended practical cap.

Suggested fix

Clamp the parsed N in the workflow step (mirror the [1,5] clamp pattern already used in benchmark-pr.yml), e.g. a small awk/bc expression that defaults to 20 when missing and otherwise floors to [1,64] (or whatever the documented max is), emitting a ::warning:: when clamping. Optionally also set jobs.abba.timeout-minutes so a runaway invocation is killed by Actions rather than relying on clamp.

AI-007: Binary cache validation only warns, doesn't fail on SHA mismatch
  • Status: confirmed
  • Severity: high
  • Location: scripts/bench_abba.sh:82
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The script checks if cached binaries exist but only prints a warning if their recorded SHAs don't match the requested REFs. It proceeds with potentially stale/wrong binaries, producing silently incorrect benchmark results.

Evidence

Lines 80–107: need_build=0 unless REBUILD=1 or files are missing. The cached SHAs in $WORK/cli_A.sha / cli_B.sha are never compared to the requested $SHA_A / $SHA_B; the only check is the printout. The self-hosted bench runner is shared across PRs, so the cache persists between invocations from different PRs.

Suggested fix

Compare $SHA_A to $(cat "$WORK/cli_A.sha") and $SHA_B to $(cat "$WORK/cli_B.sha"); if either differs, force rebuild (or at minimum, exit with a clear error unless an explicit --force-cache flag is set).

AI-008: Cargo build output fully suppressed, making build failures undebuggable
  • Status: confirmed
  • Severity: high
  • Location: scripts/bench_abba.sh:94
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, kimi:openrouter/moonshotai/kimi-k2.7-code, minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The cargo build command redirects both stdout and stderr to /dev/null. If the build fails, the script exits with a generic error later when the binary is missing, but the actual compilation error is lost.

Evidence

Line 94 redirects all cargo output to /dev/null. The script otherwise relies on set -euo pipefail so a non-zero cargo exit terminates the script; the user only sees the warning printout from the workflow step's "Last log lines" tail (the bottom 30 lines of /tmp/abba_out.txt, which exclude the discarded cargo output).

Suggested fix

Drop the &gt; /dev/null 2&gt;&amp;1 and let cargo's output stream to the captured log (tee in the workflow step already preserves it). If silence is desired for cleanliness, keep stdout quiet but log stderr to a file in $WORK and include it in failure diagnostics.

AI-009: CLI stderr suppressed during prove runs, losing failure diagnostics
  • Status: confirmed
  • Severity: high
  • Location: scripts/bench_abba.sh:112
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The run_prove function redirects stderr to /dev/null. If the prover crashes, OOMs, or errors, the diagnostic output is discarded and the script only reports 'could not parse Proving time'.

Evidence

Line 112: out="$("$1" prove "$ELF" --private-input "$INPUT" -o "$PROOF" --time 2>/dev/null)". On failure, lines 115-118 print a generic error without the actual CLI error message.

Suggested fix

Capture stderr separately or tee it to a log file. At minimum, on parse failure, include stderr in the error output: out="$("$1" prove ... 2>&1)" and check exit code before parsing.

AI-010: bench-abba job has no timeout-minutes, unlike every other bench workflow
  • Status: confirmed
  • Severity: medium
  • Location: .github/workflows/bench-abba.yml:29
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new job declares runs-on: [self-hosted, bench] but no timeout-minutes. The other bench workflows (bench-vs-nightly.yml line 19 = 720, act4-nightly.yml line 26 = 360) all set explicit timeouts. If a build hangs (e.g. a cargo rebuild that never completes because the worktree target dir is in a bad state, or a prove run that never finishes), the job will run for Actions' default 360 minutes and block the bench runner with no automatic recovery.

Evidence

Diff shows the new bench-abba.yml step block ends at line 104 with runs-on but no timeout-minutes. Default Actions timeout is 360 minutes per job. Combined with the medium finding above (unbounded pair count), a pathological input can hold the runner for hours past the documented "~30-40 min".

Suggested fix

Add timeout-minutes: 90 (or comparable) under jobs.abba so an OOM/hang cannot strand the runner for the full default.

AI-013: Silent git fetch failure may use stale refs
  • Status: confirmed
  • Severity: medium
  • Location: scripts/bench_abba.sh:54
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

git fetch origin --quiet || echo "(fetch failed; using local refs)" continues execution even if fetch fails. The script then resolves REFs against potentially stale local refs, building/benchmarking the wrong commits.

Evidence

Line 54: git fetch origin --quiet || echo " (fetch failed; using local refs)". Lines 55-56 then do git rev-parse on REFs that may not be updated. In CI the workflow does fetch-depth: 0, but manual runs or partial fetches could hit this.

Suggested fix

Make fetch failure a hard error in CI context, or at least warn loudly. For the workflow, the checkout with fetch-depth: 0 already handles this, but the script should be robust standalone.

AI-021: Default REF_A is a hardcoded PR branch leftover
  • Status: confirmed
  • Severity: low
  • Location: scripts/bench_abba.sh:40
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code, nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The default first argument is hardcoded to origin/perf/logup-fingerprint-constants, a branch from PR #696. After that PR is merged, running the script without arguments will target a stale or non-existent branch.

Evidence

Line 40 sets the default to origin/perf/logup-fingerprint-constants; line 26-27 of the header comments call this out explicitly as "default: origin/perf/logup-fingerprint-constants, PR #696". The CI workflow always passes origin/$HEAD_REF, so this default only affects manual invocations, but it's a footgun.

Suggested fix

Default to HEAD (compare against the current commit's previous state) or origin/main, or fail loudly with a usage message when called with no args.

AI-024: Incremental build in shared target directory could cause subtle issues
  • Status: confirmed
  • Severity: low
  • Location: scripts/bench_abba.sh:94
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

Both binaries are built in the same worktree/target directory. The second build (REF_A) is incremental on top of the first (REF_B)'s artifacts. If the two commits have different Cargo.lock or dependency versions, cargo may not correctly rebuild everything.

Evidence

Line 90-99: git worktree add at SHA_B, then build_cli SHA_B, then build_cli SHA_A in same $WT. The comment says 'shared target dir -> 2nd build is incremental'. Cargo usually handles this, but it's not guaranteed for all changes.

Suggested fix

Either use separate target directories (CARGO_TARGET_DIR) or do 'cargo clean' between builds. The performance cost is acceptable for a 30-40 minute benchmark.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 2
kimi openrouter/moonshotai/kimi-k2.7-code general success 4
minimax minimax/MiniMax-M3 general success 7
moonmath zro/minimax-m3 general success 6
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general success 10

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 11 6 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (6) — rejected by the verifier
  • Shell variable expansion in awk script (code injection risk) (scripts/bench_abba.sh:132, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — Line 132-133: awk "BEGIN{print ($b-$a)/$b100}". The values $a and $b come from run_prove() (line 114) which extracts them via grep -o 'Proving time: [0-9.]' — only digits and dots are captured. $a and $b are first validated to be non-empty (line 115), and the regex guarantees they contain only numeric characters. Awk injection requires non-numeric content (quotes/semicolons/system calls) which can never appear. The awk call simply computes arithmetic on guaranteed-numeric inputs.
  • t-distribution critical value approximation for df not in hardcoded table (scripts/bench_abba.sh:158, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — For the default 20 pairs, df=19 is directly in the hardcoded table (line 153, value 20:2.086 maps to df=19). For the recommended max of 32 pairs (from line 31-33 comments), df=31 — the table has 30 (2.042) and 35 (2.030). The closest-neighbor picks 30 (2.042). The true t-critical for df=31 at 95% is ~2.040. This is a ~0.1% error in the critical value, producing a CI width difference of ~0.001 percentage points for a 1% mean effect. This is far below measurement noise (pair-noise sd ~1.2%). Not a real practical issue.
  • Concurrency group does not include workflow filename — risk of accidental cross-workflow collision (.github/workflows/bench-abba.yml:14, found by minimax:minimax/MiniMax-M3) — The concurrency group is bench-abba-${{ github.event.issue.number }}. The group only collides with workflows sharing that EXACT prefix AND the same issue.number context. The other workflows use completely different prefixes: benchmark- (benchmark-pr.yml), bench-vs-nightly- (bench-vs-nightly.yml), act4-nightly (act4-nightly.yml). A hypothetical future workflow would need to deliberately adopt the bench-abba- prefix and the same event context to collide. This is purely speculative — there is no current collision risk.
  • trap - EXIT is misleading — does not remove the trap, just resets it to default (scripts/bench_abba.sh:101, found by minimax:minimax/MiniMax-M3) — Line 101: trap - EXIT. Per bash manual: 'If arg is absent (and there is a single sigspec) or -, each specified signal is reset to its original disposition.' For EXIT, the original disposition (at shell startup) is to do nothing — no trap handler. So trap - EXIT DOES effectively unset the trap. After line 100 calls cleanup explicitly, line 101 prevents cleanup from running again on normal script exit. The finding's claim that it 'does not remove the trap' is incorrect.
  • Python analysis opens CSV but never closes it (scripts/bench_abba.sh:140, found by moonmath:zro/minimax-m3) — Line 140: rows = list(csv.DictReader(open(sys.argv[1]))). The file handle from open() is never explicitly closed. In CPython (the only Python implementation targeted here), the file is garbage collected when the process exits. The benchmark script runs once and exits, so the file is freed. This is a style nit, not a bug — no resources leak because the process terminates immediately after. CPython always flushes and closes files on interpreter shutdown.
  • Wilcoxon exact p-value double-counts observed value in tails (scripts/bench_abba.sh:199, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — Line 199: p = min(1.0, 2.0 * min(sum(poly[:Wp2 + 1]), sum(poly[Wp2:])) / (1 << m)). The observed W+ value IS counted in both the lower tail (poly[:Wp2+1]) and upper tail (poly[Wp2:]). This is a standard approach for two-sided exact p-values with discrete distributions — it produces a conservative p-value by design, which is the correct behavior for hypothesis testing (controls Type I error). The alternative ('mid-p') is controversial. This is not a bug but a deliberate statistical choice documented in the Wilcoxon signed-rank literature. The finding itself calls it 'slightly conservative', which is the intended property.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…agnostics

Confirmed findings from the multi-model review:
- critical/high: SHA-aware binary cache — rebuild when cli_{A,B}.sha don't match
  the requested SHAs (was existence-only, so a persistent /tmp on the self-hosted
  runner could silently benchmark a previous PR's binaries).
- high: fork-PR head resolution — workflow now resolves headRefOid + fetches
  pull/N/head and passes the SHA (origin/<branch> doesn't exist for forks).
- high: clamp /bench-abba N to [2,40] in the workflow (was unbounded -> DoS).
- high: build output -> per-binary log, surfaced on failure (was >/dev/null).
- high: prove runs capture stderr (2>&1) so prover failures are diagnosable.
- medium: add timeout-minutes: 120 so a hang can't strand the bench runner.
- medium: louder warning on git fetch failure.
- low: REF_A is now required (dropped the hardcoded PR #696 default).
- low: fail fast if python3 is missing (before the ~30-min build).

Deliberately kept: shared cargo target across the two worktree builds (incremental
2nd build; cargo recompiles on source change, REBUILD=1 covers dep changes).
@MauroToscano

Copy link
Copy Markdown
Contributor Author

Addressed the AI review in f6495fb. Mapping each confirmed finding to its fix:

Finding Sev Resolution
AI-001 / AI-007 — stale binary reuse (SHA not checked) crit/high Fixed — cache now rebuilds when cli_{A,B}.sha ≠ requested SHA
AI-002 — fork-PR head resolution high Fixed — workflow resolves headRefOid, fetches pull/N/head, passes the SHA (works for forks; also pins against force-push races)
AI-004 — unbounded N_PAIRS (DoS) high Fixed — clamped to [2,40] in the workflow, ::warning:: on clamp
AI-008 — cargo output suppressed high Fixed — build → $WORK/build_$2.log, tail surfaced on failure
AI-009 — prover stderr suppressed high Fixed — prove runs capture 2>&1; shown on parse failure
AI-010 — no timeout-minutes med Fixedtimeout-minutes: 120 on the job
AI-013 — silent git fetch failure med Addressed — louder warning; kept non-fatal for the standalone path (CI pins the SHA + fetch-depth: 0, so it's moot there)
AI-021 — hardcoded REF_A default (PR #696) low FixedREF_A is now required
python3 availability low Fixed — fail fast if python3 missing, before the ~30-min build
AI-024 — shared cargo target dir low Kept (deliberate) — cargo recompiles on source change, so the incremental 2nd build is correct for the common prover-only PR; REBUILD=1 covers the rare dep-version-drift case. The speed benefit (no 2nd cold build) is worth it.

The 6 verifier-rejected candidates I agree with — notably the "Wilcoxon double-counts the observed value" one: that's the intended conservative two-sided exact p (controls Type-I error), not a bug; and the awk-injection one can't trigger since $a/$b are regex-validated to digits/dots.

@MauroToscano MauroToscano enabled auto-merge June 26, 2026 14:50
@MauroToscano MauroToscano added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit fdf3bfb Jun 26, 2026
12 checks passed
@MauroToscano MauroToscano deleted the bench/baseline-runs-10 branch June 26, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants