Skip to content

perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles)#700

Merged
MauroToscano merged 8 commits into
mainfrom
perf/composition-extend-fusion
Jun 26, 2026
Merged

perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles)#700
MauroToscano merged 8 commits into
mainfrom
perf/composition-extend-fusion

Conversation

@diegokingston

@diegokingston diegokingston commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What

Follow-up to #699. Now that degree-3 tables take the 2-part decompose_and_extend_d2 path, fuse its inner extend_half_to_lde.

It previously did the g²→g coset extension as two separate FFTs with an intermediate coefficient Polynomial:

let poly = Polynomial::interpolate_offset_fft(half_evals,)?;   // iFFT + alloc coeffs
evaluate_polynomial_on_lde_domain(&poly, blowup, n_trace, g)?     // forward FFT

Replaced with a single fused coset_lde_full: iFFT(n) → coset reshift g²→g → forward FFT(2n) in one pass, no intermediate coefficient Vec.

Precomputation

The weights g⁻ʲ/n (folding the 1/n iFFT normalization and the net g²→g shift) and the inverse twiddles (size lde_size/2) are precomputed once per domain in LdeTwiddles (the forward FFT reuses the existing fwd twiddles), and threaded through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2 → extend_half_to_lde. No per-call recomputation (previously the weights + both twiddle sets were rebuilt on each of the 2 halves × every table).

Byte-identical

Pure refactor of how the half-extension is computed — output unchanged.

  • test_decompose_and_extend_d2_matches_original: decompose_and_extend_d2's output equals the original interpolate_offset_fft + break_in_parts + evaluate_polynomial_on_lde_domain path.
  • New composition_extend_half_fused_matches_reference: the fused weights/twiddles reproduce the reference iFFT(g²)→FFT(g) byte-for-byte across sizes.
  • stark lib 130/130; real VM proof (fib_iterative_1200k) prove+verify OK; clippy + fmt clean.

decompose_and_extend_d2's extend_half_to_lde did iFFT(g²) → coefficient
Polynomial → evaluate_polynomial_on_lde_domain(g) as two separate FFTs with
an intermediate coefficient allocation per half. Replace with a single fused
coset_lde_full: iFFT(n) → coset reshift g²→g → forward FFT(2n=lde_size).

The weights (g⁻ʲ/n, folding the 1/n iFFT normalization and the net g²→g
shift) and the inverse twiddles (size lde_size/2) are precomputed once per
domain in LdeTwiddles (the forward FFT reuses the existing fwd twiddles), and
threaded through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2 — no
per-call recomputation. This path is now production (degree-3 tables use the
2-part decompose_and_extend_d2 after #699).

Byte-identical: test_decompose_and_extend_d2_matches_original (decompose output
== original break_in_parts path), a new formula test, stark 130/130, real VM
proof (fib_iterative_1200k) prove+verify OK, clippy + fmt clean.
@diegokingston

Copy link
Copy Markdown
Collaborator Author

/bench 5

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

Benchmark — ethrex 20 transfers (median of 3)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 81596 MB 81997 MB +401 MB (+0.5%) ⚪
Prove time 44.206s 43.329s -0.877s (-2.0%) ⚪

✅ No significant change.

✅ Low variance (time: 1.0%, heap: 1.1%)

Commit: 7e9b048 · Baseline: cached · Runner: self-hosted bench

@diegokingston diegokingston changed the title perf(stark): fuse composition half-extension onto coset_lde_full perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles) Jun 22, 2026
@diegokingston diegokingston marked this pull request as ready for review June 23, 2026 19:29
@diegokingston

Copy link
Copy Markdown
Collaborator Author

/bench

Comment thread crypto/stark/src/prover.rs Outdated
@MauroToscano

Copy link
Copy Markdown
Contributor

/bench

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-growth

@MauroToscano

Copy link
Copy Markdown
Contributor

/review-ai

@github-actions

Copy link
Copy Markdown

Codex Code Review

Findings

  • Medium - Unconditional composition twiddle allocation can materially increase prover memory.
    crypto/stark/src/prover.rs:326 and crypto/stark/src/prover.rs:355 now build comp_weights, comp_inv, and comp_fwd for every LdeTwiddles::new, and multi_prove constructs that cache for every domain during pre-pass at crypto/stark/src/prover.rs:1958. These are only used when number_of_parts == 2 at crypto/stark/src/prover.rs:1272. For number_of_parts == 1 AIRs, this adds large unused full/half-domain twiddle vectors and weights, which can be significant on large traces and undermines the prior release-build effort to avoid carrying full LayerTwiddles. Consider splitting/lazily constructing the composition cache only for domains that actually need the d2 decomposition path.

No build or test commands were run, per the static-review constraints.

Comment thread crypto/stark/src/prover.rs Outdated
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Reviewed the fused composition half-extension. Looks correct and well-tested.

Correctness (verified): The fused path runs an unnormalized iFFT (bowers_ifft_opt), so after iFFT the buffer holds n·aⱼ·(g²)ʲ (input on the g²-coset). Scaling by comp_weights[j] = g⁻ʲ/n gives aⱼ·gʲ, and the forward FFT(2n) evaluates the polynomial on the g-coset — byte-identical to the old interpolate_offset_fft → evaluate_polynomial_on_lde_domain reference. The g⁻ʲ exponent (vs. the existing coset_weights' ) is right because the halves start on the g²-coset rather than the natural domain. LdeTwiddles::new computes half_size_inv = 1/half_size and offset_inv = 1/g correctly, and the comp_inv/comp_fwd orders (half_size, lde_size) match the buffer sizes in coset_lde_full_into.

Tests: composition_extend_half_fused_matches_reference asserts byte-equality against the reference across sizes, and test_decompose_and_extend_d2_matches_original is updated for the new signature. Good coverage. The debug_assert_eq!(half_evals.len(), comp_weights.len()) is a sensible guard on the internal length invariant.

Wiring: all callers of decompose_and_extend_d2/extend_half_to_lde are updated, the twiddles thread cleanly through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2, and the GPU fast path is untouched and still consistent.

One Low-severity note left inline about comp_fwd/comp_weights being built unconditionally per domain. No blocking issues.

@github-actions

Copy link
Copy Markdown

AI Review

PR #700 · 2 changed files

Findings

Status Sev Location Finding Found by
confirmed low crypto/stark/src/prover.rs:297 Unconditional comp_inv/comp_fwd twiddles add ~3·lde_size field elements per LdeTwiddles in release builds, even when only used for number_of_parts == 2 minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-003: Unconditional `comp_inv`/`comp_fwd` twiddles add ~3·lde_size field elements per `LdeTwiddles` in release builds, even when only used for `number_of_parts == 2`
  • Status: confirmed
  • Severity: low
  • Location: crypto/stark/src/prover.rs:297
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new comp_inv: LayerTwiddles<F> and comp_fwd: LayerTwiddles<F> fields in LdeTwiddles are not feature-gated, so they are computed and held in every release build. They are only consumed by extend_half_to_lde (called from decompose_and_extend_d2 when number_of_parts == 2). For AIRs whose composition_poly_degree_bound == trace_length (number_of_parts == 1) or > 2, these twiddles are pure dead memory — totaling ~3·lde_size FieldElement<F> per cached LdeTwiddles. The legacy inv/fwd fields right above are explicitly cfg-gated out of release builds for this exact reason.

Evidence

LdeTwiddles::new always invokes LayerTwiddles::<F>::new_inverse(half_size.trailing_zeros() as u64) and LayerTwiddles::<F>::new(lde_size.trailing_zeros() as u64) (prover.rs:355-358), unconditionally storing them in the struct at lines 297-298. extend_half_to_lde (line 1211) is only reachable from the number_of_parts == 2 branch of round_2_compute_composition_polynomial (line 1272). LayerTwiddles stores n-1 total elements across layers (math/.../bowers_fft.rs:493-516). The doc comment on the struct (line 280-283) explicitly justifies gating inv/fwd out of release to avoid carrying "the extra (forward set is size n·blowup) twiddle memory for nothing", yet the new fields do exactly that.

Suggested fix

Either gate comp_inv/comp_fwd behind a #[cfg(any(test, feature = "test-utils", feature = "debug-checks"))] like the existing inv/fwd (cheapest fix), or only compute them when needed — e.g., lazily on first use of extend_half_to_lde, or precompute only in the same code path that consumes them (R2 with number_of_parts == 2). Also consider whether comp_weights should stay cached at all; it's half_size base-field elements built by a single O(half_size) loop, which is cheap to recompute per call compared to the two LayerTwiddles allocations.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 0
kimi openrouter/moonshotai/kimi-k2.7-code general success 2
minimax minimax/MiniMax-M3 general success 2
moonmath zro/minimax-m3 general success 0
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general success 3

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 1 6 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (6) — rejected by the verifier
  • Panic on FFT error in new extend_half_to_lde function (crypto/stark/src/prover.rs:1227, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The PR replaces two .expect() calls (interpolate_offset_fft + evaluate_polynomial_on_lde_domain) with a single .expect("coset extension") on coset_lde_full. The old code also panicked on FFT errors — the pattern is identical and pre-existing. The PR does not introduce a new panic risk; it reduces the number of panic points from 2 to 1.
  • domain_cache_bytes undercounts LdeTwiddles after adding composition twiddles (prover/src/auto_storage.rs:98, found by kimi:openrouter/moonshotai/kimi-k2.7-code) — domain_cache_bytes in prover/src/auto_storage.rs is a rough heuristic memory estimator in a separate crate, not modified by this PR. The claim that it 'undercounts' is speculative — the function already uses an approximate formula (rows * (3 + 2*blowup) * 8), and any imprecision in the estimator is pre-existing. The PR doesn't change the estimator, and the current code has since switched to OnceLock lazy initialization, making the concern moot.
  • LdeTwiddles::new panics on invalid domain parameters (crypto/stark/src/prover.rs:329, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The .expect() calls in LdeTwiddles::new (e.g. 'domain_size is power of two', 'half_size and coset offset are non-zero', etc.) follow the same pattern used throughout the codebase — including the old code in the same function and in other FFT-related initialization. Domain parameters are guaranteed by Domain construction; these expects are precondition checks consistent with existing practice. The PR adds 2 new expects (comp_inv, comp_fwd) but the pattern is unchanged.
  • GPU path in decompose_and_extend_d2 bypasses precomputed twiddles (crypto/stark/src/prover.rs:1191, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The GPU fast path (try_extend_two_halves_gpu) is a performance optimization path that presumably handles twiddle computation internally on device — this is by design for GPU offload. The finding is a performance observation (possible duplicated twiddle work), not a correctness bug. The finding itself notes 'GPU path is only triggered for large sizes where compute dominates,' conceding the trade-off is acceptable. Not a regression introduced by this PR.
  • coset_lde_full allocates a fresh Vec<FieldElement<E>> of size 2n per call inside extend_half_to_lde, with no coset_lde_full_into reuse (crypto/stark/src/prover.rs:1220, found by minimax:minimax/MiniMax-M3) — The old code used Polynomial::interpolate_offset_fft + evaluate_polynomial_on_lde_domain, which also allocate fresh Vecs (coefficients + evaluations). The new code uses coset_lde_full which allocates one Vec instead of two. The allocation count decreased, not increased. The finding's concern about lack of a coset_lde_full_into variant is a feature request, not a regression.
  • Stale LdeTwiddles (~32 MB) comment after memory growth (crypto/stark/src/prover.rs:1937, found by kimi:openrouter/moonshotai/kimi-k2.7-code) — The comment at line 1964 reads: 'Without dedup, each creates its own Domain (~24 MB) and LdeTwiddles (~32 MB).' This is an approximate, ballpark comment (note the tilde), and the memory increase from the PR's new fields is context-dependent. The current code uses OnceLock lazy initialization, so the composition twiddles don't inflate cache entries unless actually used. The finding is about a documentation nit too speculative to flag as a real issue.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench

@MauroToscano MauroToscano added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit e3dd2d1 Jun 26, 2026
12 checks passed
@MauroToscano MauroToscano deleted the perf/composition-extend-fusion branch June 26, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants