ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710
Closed
MauroToscano wants to merge 1 commit into
Closed
ci(bench): two-tier benchmarking — cheap-tier knobs + on-demand /bench-abba tiebreaker#710MauroToscano wants to merge 1 commit into
MauroToscano wants to merge 1 commit into
Conversation
The cached baseline is the amortized side of the PR-vs-baseline comparison (computed once per main push, off the PR critical path), yet it ran at the same n=3 as the PR. With unequal n the smaller-n side dominates the comparison noise: at n_pr=3 / n_base=3 the baseline carries ~50% of the variance (~77% when a PR uses /bench 10). Bumping it to 10 lowers the 95% detection floor on a fresh baseline with no added PR-side latency. 10 is the current clamp ceiling.
Contributor
Author
|
Reopened as #712 (GitHub wouldn't reopen this one after a force-push orphaned its closing commit). Same branch, full analysis in the new description. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds a two-tier PR benchmarking flow so we can trust small (~1%) prover deltas without slowing every PR.
Tier 1 — the cheap screen (unchanged cost, auto on
/bench)BENCH_RUNS_BASELINE: 3 → 5; clamp/bench Nto 1–5 (per-PR default stays 3 for fast feedback)./bench-abba.Tier 2 — the on-demand tiebreaker (
/bench-abba, manual only)New workflow
bench-abba.yml+scripts/bench_abba.sh:/bench-abbacomment on a PR from a repo member. Never auto-triggers — it occupies the single bench server for ~30–40 min, and posts a "server occupied" notice when it starts.maincliin an isolated worktree), then posts a paired-t CI + exact Wilcoxon test (pure-stdlib, no scipy) as a PR comment./bench-abba 32(default 20 → resolves ~1%; 32 → ~0.6%)./bench-abbais excluded from the regular/benchtrigger so it doesn't double-fire.Why two tiers
Empirically (ethrex 20-transfer,
50s/run): the cheap path resolves ≥1.5% but is drift-limited near 1%; the ABBA path cancels drift via pairing and resolves ~1% with ~20 pairs, ~0.6% with ~32 — at the cost of ~30–40 min of server time, so it's opt-in. PR #696 was confirmed a real ~0.9% speedup this way (inconclusive at 8 pairs, clean at 24).