perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles) by diegokingston · Pull Request #700 · yetanotherco/lambda_vm

diegokingston · 2026-06-22T22:22:26Z

What

Follow-up to #699. Now that degree-3 tables take the 2-part decompose_and_extend_d2 path, fuse its inner extend_half_to_lde.

It previously did the g²→g coset extension as two separate FFTs with an intermediate coefficient Polynomial:

let poly = Polynomial::interpolate_offset_fft(half_evals, g²)?;   // iFFT + alloc coeffs
evaluate_polynomial_on_lde_domain(&poly, blowup, n_trace, g)?     // forward FFT

Replaced with a single fused coset_lde_full: iFFT(n) → coset reshift g²→g → forward FFT(2n) in one pass, no intermediate coefficient Vec.

Precomputation

The weights g⁻ʲ/n (folding the 1/n iFFT normalization and the net g²→g shift) and the inverse twiddles (size lde_size/2) are precomputed once per domain in LdeTwiddles (the forward FFT reuses the existing fwd twiddles), and threaded through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2 → extend_half_to_lde. No per-call recomputation (previously the weights + both twiddle sets were rebuilt on each of the 2 halves × every table).

Byte-identical

Pure refactor of how the half-extension is computed — output unchanged.

test_decompose_and_extend_d2_matches_original: decompose_and_extend_d2's output equals the original interpolate_offset_fft + break_in_parts + evaluate_polynomial_on_lde_domain path.
New composition_extend_half_fused_matches_reference: the fused weights/twiddles reproduce the reference iFFT(g²)→FFT(g) byte-for-byte across sizes.
stark lib 130/130; real VM proof (fib_iterative_1200k) prove+verify OK; clippy + fmt clean.

decompose_and_extend_d2's extend_half_to_lde did iFFT(g²) → coefficient Polynomial → evaluate_polynomial_on_lde_domain(g) as two separate FFTs with an intermediate coefficient allocation per half. Replace with a single fused coset_lde_full: iFFT(n) → coset reshift g²→g → forward FFT(2n=lde_size). The weights (g⁻ʲ/n, folding the 1/n iFFT normalization and the net g²→g shift) and the inverse twiddles (size lde_size/2) are precomputed once per domain in LdeTwiddles (the forward FFT reuses the existing fwd twiddles), and threaded through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2 — no per-call recomputation. This path is now production (degree-3 tables use the 2-part decompose_and_extend_d2 after #699). Byte-identical: test_decompose_and_extend_d2_matches_original (decompose output == original break_in_parts path), a new formula test, stark 130/130, real VM proof (fib_iterative_1200k) prove+verify OK, clippy + fmt clean.

diegokingston · 2026-06-22T22:22:34Z

/bench 5

github-actions · 2026-06-22T22:24:37Z

Benchmark — ethrex 20 transfers (median of 3)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	81596 MB	81997 MB	+401 MB (+0.5%) ⚪
Prove time	44.206s	43.329s	-0.877s (-2.0%) ⚪

✅ No significant change.

✅ Low variance (time: 1.0%, heap: 1.1%)

_{Commit: 7e9b048 · Baseline: cached · Runner: self-hosted bench}

…warnings)

diegokingston · 2026-06-25T14:56:30Z

/bench

MauroToscano · 2026-06-26T20:01:27Z

/bench

MauroToscano · 2026-06-26T20:06:57Z

/bench-growth

MauroToscano · 2026-06-26T20:09:13Z

/review-ai

github-actions · 2026-06-26T20:11:14Z

Codex Code Review

Findings

Medium - Unconditional composition twiddle allocation can materially increase prover memory.
crypto/stark/src/prover.rs:326 and crypto/stark/src/prover.rs:355 now build comp_weights, comp_inv, and comp_fwd for every LdeTwiddles::new, and multi_prove constructs that cache for every domain during pre-pass at crypto/stark/src/prover.rs:1958. These are only used when number_of_parts == 2 at crypto/stark/src/prover.rs:1272. For number_of_parts == 1 AIRs, this adds large unused full/half-domain twiddle vectors and weights, which can be significant on large traces and undermines the prior release-build effort to avoid carrying full LayerTwiddles. Consider splitting/lazily constructing the composition cache only for domains that actually need the d2 decomposition path.

No build or test commands were run, per the static-review constraints.

claude · 2026-06-26T20:12:30Z

Reviewed the fused composition half-extension. Looks correct and well-tested.

Correctness (verified): The fused path runs an unnormalized iFFT (bowers_ifft_opt), so after iFFT the buffer holds n·aⱼ·(g²)ʲ (input on the g²-coset). Scaling by comp_weights[j] = g⁻ʲ/n gives aⱼ·gʲ, and the forward FFT(2n) evaluates the polynomial on the g-coset — byte-identical to the old interpolate_offset_fft → evaluate_polynomial_on_lde_domain reference. The g⁻ʲ exponent (vs. the existing coset_weights' gʲ) is right because the halves start on the g²-coset rather than the natural domain. LdeTwiddles::new computes half_size_inv = 1/half_size and offset_inv = 1/g correctly, and the comp_inv/comp_fwd orders (half_size, lde_size) match the buffer sizes in coset_lde_full_into.

Tests: composition_extend_half_fused_matches_reference asserts byte-equality against the reference across sizes, and test_decompose_and_extend_d2_matches_original is updated for the new signature. Good coverage. The debug_assert_eq!(half_evals.len(), comp_weights.len()) is a sensible guard on the internal length invariant.

Wiring: all callers of decompose_and_extend_d2/extend_half_to_lde are updated, the twiddles thread cleanly through prove_rounds_2_to_4 → round_2 → decompose_and_extend_d2, and the GPU fast path is untouched and still consistent.

One Low-severity note left inline about comp_fwd/comp_weights being built unconditionally per domain. No blocking issues.

github-actions · 2026-06-26T20:26:47Z

AI Review

PR #700 · 2 changed files

Findings

Status	Sev	Location	Finding	Found by
confirmed	low	crypto/stark/src/prover.rs:297	Unconditional `comp_inv`/`comp_fwd` twiddles add ~3·lde_size field elements per `LdeTwiddles` in release builds, even when only used for `number_of_parts == 2`	minimax minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-003: Unconditional `comp_inv`/`comp_fwd` twiddles add ~3·lde_size field elements per `LdeTwiddles` in release builds, even when only used for `number_of_parts == 2`

Status: confirmed
Severity: low
Location: crypto/stark/src/prover.rs:297
Found by: minimax:minimax/MiniMax-M3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new comp_inv: LayerTwiddles<F> and comp_fwd: LayerTwiddles<F> fields in LdeTwiddles are not feature-gated, so they are computed and held in every release build. They are only consumed by extend_half_to_lde (called from decompose_and_extend_d2 when number_of_parts == 2). For AIRs whose composition_poly_degree_bound == trace_length (number_of_parts == 1) or > 2, these twiddles are pure dead memory — totaling ~3·lde_size FieldElement<F> per cached LdeTwiddles. The legacy inv/fwd fields right above are explicitly cfg-gated out of release builds for this exact reason.

Evidence

LdeTwiddles::new always invokes LayerTwiddles::<F>::new_inverse(half_size.trailing_zeros() as u64) and LayerTwiddles::<F>::new(lde_size.trailing_zeros() as u64) (prover.rs:355-358), unconditionally storing them in the struct at lines 297-298. extend_half_to_lde (line 1211) is only reachable from the number_of_parts == 2 branch of round_2_compute_composition_polynomial (line 1272). LayerTwiddles stores n-1 total elements across layers (math/.../bowers_fft.rs:493-516). The doc comment on the struct (line 280-283) explicitly justifies gating inv/fwd out of release to avoid carrying "the extra (forward set is size n·blowup) twiddle memory for nothing", yet the new fields do exactly that.

Suggested fix

Either gate comp_inv/comp_fwd behind a #[cfg(any(test, feature = "test-utils", feature = "debug-checks"))] like the existing inv/fwd (cheapest fix), or only compute them when needed — e.g., lazily on first use of extend_half_to_lde, or precompute only in the same code path that consumes them (R2 with number_of_parts == 2). Also consider whether comp_weights should stay cached at all; it's half_size base-field elements built by a single O(half_size) loop, which is cheap to recompute per call compared to the two LayerTwiddles allocations.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	success	0
kimi	openrouter/moonshotai/kimi-k2.7-code	general	success	2
minimax	minimax/MiniMax-M3	general	success	2
moonmath	zro/minimax-m3	general	success	0
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	success	3

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	success	1	6	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (6) — rejected by the verifier

Panic on FFT error in new extend_half_to_lde function (crypto/stark/src/prover.rs:1227, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The PR replaces two .expect() calls (interpolate_offset_fft + evaluate_polynomial_on_lde_domain) with a single .expect("coset extension") on coset_lde_full. The old code also panicked on FFT errors — the pattern is identical and pre-existing. The PR does not introduce a new panic risk; it reduces the number of panic points from 2 to 1.
domain_cache_bytes undercounts LdeTwiddles after adding composition twiddles (prover/src/auto_storage.rs:98, found by kimi:openrouter/moonshotai/kimi-k2.7-code) — domain_cache_bytes in prover/src/auto_storage.rs is a rough heuristic memory estimator in a separate crate, not modified by this PR. The claim that it 'undercounts' is speculative — the function already uses an approximate formula (rows * (3 + 2*blowup) * 8), and any imprecision in the estimator is pre-existing. The PR doesn't change the estimator, and the current code has since switched to OnceLock lazy initialization, making the concern moot.
LdeTwiddles::new panics on invalid domain parameters (crypto/stark/src/prover.rs:329, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The .expect() calls in LdeTwiddles::new (e.g. 'domain_size is power of two', 'half_size and coset offset are non-zero', etc.) follow the same pattern used throughout the codebase — including the old code in the same function and in other FFT-related initialization. Domain parameters are guaranteed by Domain construction; these expects are precondition checks consistent with existing practice. The PR adds 2 new expects (comp_inv, comp_fwd) but the pattern is unchanged.
GPU path in decompose_and_extend_d2 bypasses precomputed twiddles (crypto/stark/src/prover.rs:1191, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The GPU fast path (try_extend_two_halves_gpu) is a performance optimization path that presumably handles twiddle computation internally on device — this is by design for GPU offload. The finding is a performance observation (possible duplicated twiddle work), not a correctness bug. The finding itself notes 'GPU path is only triggered for large sizes where compute dominates,' conceding the trade-off is acceptable. Not a regression introduced by this PR.
coset_lde_full allocates a fresh Vec<FieldElement<E>> of size 2n per call inside extend_half_to_lde, with no coset_lde_full_into reuse (crypto/stark/src/prover.rs:1220, found by minimax:minimax/MiniMax-M3) — The old code used Polynomial::interpolate_offset_fft + evaluate_polynomial_on_lde_domain, which also allocate fresh Vecs (coefficients + evaluations). The new code uses coset_lde_full which allocates one Vec instead of two. The allocation count decreased, not increased. The finding's concern about lack of a coset_lde_full_into variant is a feature request, not a regression.
Stale LdeTwiddles (~32 MB) comment after memory growth (crypto/stark/src/prover.rs:1937, found by kimi:openrouter/moonshotai/kimi-k2.7-code) — The comment at line 1964 reads: 'Without dedup, each creates its own Domain (~24 MB) and LdeTwiddles (~32 MB).' This is an approximate, ballpark comment (note the tilde), and the memory increase from the PR's new fields is context-dependent. The current code uses OnceLock lazy initialization, so the composition twiddles don't inflate cache entries unless actually used. The finding is about a documentation nit too speculative to flag as a real issue.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

MauroToscano · 2026-06-26T20:32:42Z

/bench

diegokingston changed the title ~~perf(stark): fuse composition half-extension onto coset_lde_full~~ perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles) Jun 22, 2026

fix(stark): drop clone_on_copy in composition extend test (clippy -D …

536466f

…warnings)

diegokingston marked this pull request as ready for review June 23, 2026 19:29

diegokingston added 2 commits June 24, 2026 12:18

Merge branch 'main' into perf/composition-extend-fusion

f6fe5d9

Merge branch 'main' into perf/composition-extend-fusion

03713ab

diegokingston commented Jun 25, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs Outdated

MauroToscano and others added 3 commits June 26, 2026 16:37

Merge branch 'main' into perf/composition-extend-fusion

2d715cb

fix(stark): keep composition LDE twiddles in release builds

3951ae2

Merge branch 'main' into perf/composition-extend-fusion

c8693d6

claude Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs Outdated

fix(stark): lazy composition LDE twiddle cache

7e9b048

MauroToscano approved these changes Jun 26, 2026

View reviewed changes

MauroToscano added this pull request to the merge queue Jun 26, 2026

Merged via the queue into main with commit e3dd2d1 Jun 26, 2026
12 checks passed

MauroToscano deleted the perf/composition-extend-fusion branch June 26, 2026 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles)#700

perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles)#700
MauroToscano merged 8 commits into
mainfrom
perf/composition-extend-fusion

diegokingston commented Jun 22, 2026 •

edited

Loading

Uh oh!

diegokingston commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

diegokingston commented Jun 25, 2026

Uh oh!

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

diegokingston commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Precomputation

Byte-identical

Uh oh!

diegokingston commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — ethrex 20 transfers (median of 3)

Uh oh!

diegokingston commented Jun 25, 2026

Uh oh!

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Codex Code Review

Uh oh!

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

MauroToscano commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

diegokingston commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading