ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker by MauroToscano · Pull Request #710 · yetanotherco/lambda_vm

MauroToscano · 2026-06-25T15:31:40Z

Builds a two-tier PR benchmarking flow so we can trust small (~1%) prover deltas without slowing every PR.

Tier 1 — the cheap screen (unchanged cost, auto on `/bench`)

BENCH_RUNS_BASELINE: 3 → 5; clamp /bench N to 1–5 (per-PR default stays 3 for fast feedback).
Why cap at 5: the single-session cached comparison can't beat the ~1% cross-session drift wall, so more runs buy little — measured run-to-run CV is ~0.5–0.9%, giving a reliable floor of ~1.5% at 5 runs vs ~1.1% at 10. Not worth the latency.
When a PR shows a small time speedup (<1.5%) the cheap CI can't confirm, the benchmark comment now suggests escalating to /bench-abba.

Tier 2 — the on-demand tiebreaker (`/bench-abba`, manual only)

New workflow bench-abba.yml + scripts/bench_abba.sh:

Triggered only by a /bench-abba comment on a PR from a repo member. Never auto-triggers — it occupies the single bench server for ~30–40 min, and posts a "server occupied" notice when it starts.
Runs a drift-free interleaved A/B/B/A paired benchmark (builds the PR and main cli in an isolated worktree), then posts a paired-t CI + exact Wilcoxon test (pure-stdlib, no scipy) as a PR comment.
Optional pair count: /bench-abba 32 (default 20 → resolves ~1%; 32 → ~0.6%).
/bench-abba is excluded from the regular /bench trigger so it doesn't double-fire.

Why two tiers

Empirically (ethrex 20-transfer, ~~50s/run): the cheap path resolves ≥~~1.5% but is drift-limited near 1%; the ABBA path cancels drift via pairing and resolves ~1% with ~20 pairs, ~0.6% with ~32 — at the cost of ~30–40 min of server time, so it's opt-in. PR #696 was confirmed a real ~0.9% speedup this way (inconclusive at 8 pairs, clean at 24).

The cached baseline is the amortized side of the PR-vs-baseline comparison (computed once per main push, off the PR critical path), yet it ran at the same n=3 as the PR. With unequal n the smaller-n side dominates the comparison noise: at n_pr=3 / n_base=3 the baseline carries ~50% of the variance (~77% when a PR uses /bench 10). Bumping it to 10 lowers the 95% detection floor on a fresh baseline with no added PR-side latency. 10 is the current clamp ceiling.

MauroToscano · 2026-06-25T18:33:30Z

Reopened as #712 (GitHub wouldn't reopen this one after a force-push orphaned its closing commit). Same branch, full analysis in the new description.

MauroToscano closed this Jun 25, 2026

MauroToscano changed the title ~~ci(bench): raise baseline run count 3 -> 10~~ ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker Jun 25, 2026

MauroToscano mentioned this pull request Jun 25, 2026

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker #712

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710

ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710
MauroToscano wants to merge 1 commit into
mainfrom
bench/baseline-runs-10

MauroToscano commented Jun 25, 2026 •

edited

Loading

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MauroToscano commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier 1 — the cheap screen (unchanged cost, auto on /bench)

Tier 2 — the on-demand tiebreaker (/bench-abba, manual only)

Why two tiers

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MauroToscano commented Jun 25, 2026 •

edited

Loading

Tier 1 — the cheap screen (unchanged cost, auto on `/bench`)

Tier 2 — the on-demand tiebreaker (`/bench-abba`, manual only)