Perf/cpu lde rework#650
Conversation
|
/bench |
Benchmark — ethrex 20 transfers (median of 3)Table parallelism: auto (cores / 3)
Commit: 602a36b · Baseline: built from main · Runner: self-hosted bench |
|
/bench-growth |
1 similar comment
|
/bench-growth |
|
/bench |
|
/bench |
|
/bench |
Codex Code ReviewFindings
Verification I attempted |
Code ReviewOverviewThis PR reworks Round 1 of the STARK prover for better cache behaviour and parallelism:
The algorithmic changes look mathematically sound and are well-tested with differential tests against the reference per-column path (both base and ext3). The proof is byte-identical for the lookup optimisation and for the main LDE rework. IssuesMedium (2 items) — see inline comments
Low (1 item) — see inline comment
What looks good
|
MauroToscano
left a comment
There was a problem hiding this comment.
Requesting changes pending the GPU-side review.
This reworks the CPU trace-LDE pipeline (row-major LDE + cache-blocked two-half FFT + per-domain precomputed coset twiddles), but the diff doesn't touch crypto/math-cuda — there are no GPU/CUDA changes here. The "byte-identical committed polynomial" claim has only been established for the CPU path; the same rework is not obviously equivalent on the GPU, so we need to confirm the GPU LDE/FFT produces the same commitment under the new layout/ordering (or port the rework to the GPU side) before this lands — otherwise CPU- and GPU-built proofs could diverge.
Holding merge until the GPU equivalence is sorted out (Juampi is looking at that side).
|
/bench |
|
/review-ai |
|
/bench |
Codex Code ReviewNo issues found in the PR diff. I reviewed the changed FFT/LDE paths, row-major trace storage/commitment changes, CUDA handle call sites, preprocessed split commitments, and trace-builder parallelization. I did not run builds or tests per the review constraints. |
|
Review: Row-major trace LDE + two-half FFT (PR #650) Thorough static review of the full diff. No correctness, safety, or soundness bugs found. The PR is carefully engineered and well-tested. Summary of what I checked plus a couple of low-severity notes. Safety / soundness — looks solid
Concurrency / determinism
Minor (Low)
Dead code — handled well Nice work — the differential tests against the single-column Bowers oracle give good confidence in the FFT rewrite. |
AI ReviewPR #650 · 19 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-002: Misleading comment about memory savings in bit_reversing.rs
Claim The comment "saves ~16 MB at log21 n=64" appears to be off by a factor of 2 on 64-bit platforms and has a confusing/typo'd wording ("log21 n=64"). Evidence At log_n=21, n=2^21=2097152. An upfront Vec<(usize, usize)> collection would use 16 bytes per pair on 64-bit (or 8 bytes on 32-bit). On 64-bit: 2^21 * 16 = 33554432 bytes ≈ 32 MB, not 16 MB. On 32-bit: 2^21 * 8 = 16 MB. The wording "log21 n=64" is unclear — likely meant "log_n=21, n=2^21=2M" or just "log_n=21". Suggested fix Update the comment to: "// No upfront Vec<(usize, usize)> collection (saves ~32 MB at log_n=21 on 64-bit)." AI-006: LdeTwiddles builds two_half_inv/two_half_fwd unconditionally even when only legacy fields are used
Claim TwoHalfTwiddles::new is always called in LdeTwiddles::new, including on builds that do not use it (e.g., debug-checks-only with precomputed test-utils) Evidence The struct Suggested fix If the intent is to keep release builds free of the legacy per-column path, this is fine. Otherwise, gate the two-half allocations under a feature flag or only build them when AI-007: commit_rows_bit_reversed_subset uses debug_assert for col_end bounds and divisibility
Claim The debug_assert checks Evidence Lines 645-646 use debug_assert! for col_end <= num_cols and data.len() % num_cols == 0; line 668 slices data[row_start+col_start..row_start+col_end] which would panic on out-of-bounds in release if invariants were violated. Suggested fix Acceptable as-is given all in-tree callers pass valid widths. Could upgrade to a checked assert if this ever becomes a public/external API. Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Discarded candidates (7) — rejected by the verifier
Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
|
/bench |
This PR reworks the trace LDE pipeline (Round 1) to be row-major and replaces the flat radix-2 FFT with a cache-blocked, two-half batched FFT. It also parallelizes trace generation and drops a redundant multiply in the LogUp fingerprint.
Row-major trace LDE + two-half FFT (Round 1)
Round 1 (trace LDE + Merkle commit) is ~48% of the prover wall and paid two avoidable costs: a column → row transpose of the full LDE-size buffer, and a flat FFT whose early-layer stride thrashes cache at large
n(memory-bandwidth bound, not compute bound).The forward/inverse transforms now run as a two-half decomposition —
√n-sized sub-FFTs that stay cache-resident, a twiddle pass, then the second half — and the coset-LDE twiddles are precomputed per domain. Since the traceTableis already row-major, Round 1 reads it with onememcpy(no transpose) and commits contiguous Merkle leaves. The committed polynomial is unchanged: byte-identical proof, same roots, same FRI, same verifier.crypto/math/src/fft/{bowers_fft_batch,two_half_fft}.rs,crypto/math/src/polynomial.rscrypto/stark/src/{trace,prover}.rsParallel trace generation
The ~19 independent table builds in
build_tracesran sequentially; they now run in onerayonscope, with chunked tables produced in parallel (trace build ~15.6% → ~10.9% of the wall). Disk-spill keeps the sequential generate→spill order so peak memory stays bounded. Cleanups in the same area: the dedup maps (LT/MUL/DVRM/BRANCH) move SipHash → FxHash, the DECODE pc→row map reuses the executor's U64 hasher, and register routing avoids a per-opVec.Table contents are unchanged; the dedup tables' row order can differ (hasher iteration order, already non-deterministic under SipHash's randomized seed) — any order is valid for a lookup table, so proofs stay sound but are not byte-identical to a given
mainrun.prover/src/tables/{trace_builder,types,lt,mul,dvrm,branch,decode}.rsSkip the identity multiply in LogUp fingerprints
alpha_powers[0]is always1, sobus_id · alpha_powers[0]was a multiply by one in both the witness build and the constraint evaluation. It is now the embeddedbus_iddirectly — byte-identical.crypto/stark/src/lookup.rs