Skip to content

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710

Closed
MauroToscano wants to merge 1 commit into
mainfrom
bench/baseline-runs-10
Closed

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710
MauroToscano wants to merge 1 commit into
mainfrom
bench/baseline-runs-10

Conversation

@MauroToscano

@MauroToscano MauroToscano commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Builds a two-tier PR benchmarking flow so we can trust small (~1%) prover deltas without slowing every PR.

Tier 1 — the cheap screen (unchanged cost, auto on /bench)

  • BENCH_RUNS_BASELINE: 3 → 5; clamp /bench N to 1–5 (per-PR default stays 3 for fast feedback).
  • Why cap at 5: the single-session cached comparison can't beat the ~1% cross-session drift wall, so more runs buy little — measured run-to-run CV is ~0.5–0.9%, giving a reliable floor of ~1.5% at 5 runs vs ~1.1% at 10. Not worth the latency.
  • When a PR shows a small time speedup (<1.5%) the cheap CI can't confirm, the benchmark comment now suggests escalating to /bench-abba.

Tier 2 — the on-demand tiebreaker (/bench-abba, manual only)

New workflow bench-abba.yml + scripts/bench_abba.sh:

  • Triggered only by a /bench-abba comment on a PR from a repo member. Never auto-triggers — it occupies the single bench server for ~30–40 min, and posts a "server occupied" notice when it starts.
  • Runs a drift-free interleaved A/B/B/A paired benchmark (builds the PR and main cli in an isolated worktree), then posts a paired-t CI + exact Wilcoxon test (pure-stdlib, no scipy) as a PR comment.
  • Optional pair count: /bench-abba 32 (default 20 → resolves ~1%; 32 → ~0.6%).
  • /bench-abba is excluded from the regular /bench trigger so it doesn't double-fire.

Why two tiers

Empirically (ethrex 20-transfer, 50s/run): the cheap path resolves ≥1.5% but is drift-limited near 1%; the ABBA path cancels drift via pairing and resolves ~1% with ~20 pairs, ~0.6% with ~32 — at the cost of ~30–40 min of server time, so it's opt-in. PR #696 was confirmed a real ~0.9% speedup this way (inconclusive at 8 pairs, clean at 24).

The cached baseline is the amortized side of the PR-vs-baseline comparison
(computed once per main push, off the PR critical path), yet it ran at the
same n=3 as the PR. With unequal n the smaller-n side dominates the comparison
noise: at n_pr=3 / n_base=3 the baseline carries ~50% of the variance (~77%
when a PR uses /bench 10). Bumping it to 10 lowers the 95% detection floor on
a fresh baseline with no added PR-side latency. 10 is the current clamp ceiling.
@MauroToscano MauroToscano changed the title ci(bench): raise baseline run count 3 -> 10 ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker Jun 25, 2026
@MauroToscano

Copy link
Copy Markdown
Contributor Author

Reopened as #712 (GitHub wouldn't reopen this one after a force-push orphaned its closing commit). Same branch, full analysis in the new description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant