Skip to content

Perf/cpu lde rework#650

Merged
MauroToscano merged 25 commits into
mainfrom
perf/cpu-lde-rework
Jun 26, 2026
Merged

Perf/cpu lde rework#650
MauroToscano merged 25 commits into
mainfrom
perf/cpu-lde-rework

Conversation

@jotabulacios

@jotabulacios jotabulacios commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

This PR reworks the trace LDE pipeline (Round 1) to be row-major and replaces the flat radix-2 FFT with a cache-blocked, two-half batched FFT. It also parallelizes trace generation and drops a redundant multiply in the LogUp fingerprint.

Row-major trace LDE + two-half FFT (Round 1)

Round 1 (trace LDE + Merkle commit) is ~48% of the prover wall and paid two avoidable costs: a column → row transpose of the full LDE-size buffer, and a flat FFT whose early-layer stride thrashes cache at large n (memory-bandwidth bound, not compute bound).

The forward/inverse transforms now run as a two-half decomposition — √n-sized sub-FFTs that stay cache-resident, a twiddle pass, then the second half — and the coset-LDE twiddles are precomputed per domain. Since the trace Table is already row-major, Round 1 reads it with one memcpy (no transpose) and commits contiguous Merkle leaves. The committed polynomial is unchanged: byte-identical proof, same roots, same FRI, same verifier.

  • crypto/math/src/fft/{bowers_fft_batch,two_half_fft}.rs, crypto/math/src/polynomial.rs
  • crypto/stark/src/{trace,prover}.rs

Parallel trace generation

The ~19 independent table builds in build_traces ran sequentially; they now run in one rayon scope, with chunked tables produced in parallel (trace build ~15.6% → ~10.9% of the wall). Disk-spill keeps the sequential generate→spill order so peak memory stays bounded. Cleanups in the same area: the dedup maps (LT/MUL/DVRM/BRANCH) move SipHash → FxHash, the DECODE pc→row map reuses the executor's U64 hasher, and register routing avoids a per-op Vec.

Table contents are unchanged; the dedup tables' row order can differ (hasher iteration order, already non-deterministic under SipHash's randomized seed) — any order is valid for a lookup table, so proofs stay sound but are not byte-identical to a given main run.

  • prover/src/tables/{trace_builder,types,lt,mul,dvrm,branch,decode}.rs

Skip the identity multiply in LogUp fingerprints

alpha_powers[0] is always 1, so bus_id · alpha_powers[0] was a multiply by one in both the witness build and the constraint evaluation. It is now the embedded bus_id directly — byte-identical.

  • crypto/stark/src/lookup.rs

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Benchmark — ethrex 20 transfers (median of 3)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 80448 MB 81801 MB +1353 MB (+1.7%) ⚪
Prove time 50.187s 44.993s -5.194s (-10.3%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 0.9%, heap: 1.3%)

Commit: 602a36b · Baseline: built from main · Runner: self-hosted bench

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench-growth

1 similar comment
@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench-growth

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@jotabulacios jotabulacios marked this pull request as ready for review June 18, 2026 14:14
@github-actions

Copy link
Copy Markdown

Codex Code Review

Findings

  • High Security: crypto/math/src/fft/bit_reversing.rs exposes a safe public function with an unsafe parallel implementation, but the required layout invariants are only debug_asserts at lines 40 and 48. With parallel enabled, invalid inputs can reach the raw-pointer swap block at lines 56-69. Safe Rust APIs must not permit UB for bad caller input. Make the divisibility and power-of-two checks runtime assert!s before the unsafe path, or return a Result; alternatively make the function unsafe and document the contract.

  • Medium Security / DoS: prover/src/tables/types.rs replaces SipHash with a deterministic non-keyed FxHasher, then uses it for unbounded op-dedup maps such as branch.rs, lt.rs, mul.rs, and dvrm.rs. If prover inputs are user-controlled, an attacker can choose VM values that collide and degrade trace generation toward quadratic behavior. Keep std::collections::HashMap for adversarial inputs, or use a keyed fast hasher.

Verification

I attempted cargo check --all-targets, but rustup could not create temp files under /home/runner/.rustup/tmp because that path is read-only in this sandbox.

Comment thread crypto/math/src/fft/bit_reversing.rs
Comment thread crypto/stark/src/prover.rs
Comment thread crypto/stark/src/prover.rs
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR reworks Round 1 of the STARK prover for better cache behaviour and parallelism:

  • Replaces the flat radix-2 Bowers FFT with a cache-blocked two-half batched FFT
  • Makes the LDE pipeline end-to-end row-major (trace → IFFT → coset scale → FFT → Merkle), eliminating the col→row transpose
  • Parallelises the ~19 independent table builds in build_traces via a rayon::scope
  • Swaps SipHash dedup maps for FxHashMap and drops a redundant multiply-by-one in LogUp fingerprints

The algorithmic changes look mathematically sound and are well-tested with differential tests against the reference per-column path (both base and ext3). The proof is byte-identical for the lookup optimisation and for the main LDE rework.


Issues

Medium (2 items) — see inline comments

  1. Incomplete unsafe justification in bit_reversing.rs — The SAFETY comment states the row-swap pairs are "pairwise disjoint" but does not prove it. The key lemma is that bit-reverse is an involution (br(br(i)) = i), so every non-trivial orbit is a 2-cycle; selecting the representative where br(i) > i then gives a partition. Without this, reviewers must independently reconstruct the argument. The comment for Ordering::Relaxed also needs a line explaining that Rayon's thread-spawn provides the happens-before edge that makes the relaxed load correct. (The code itself is safe; the documentation is the gap.)

  2. Possible compile error without debug-checksexpand_columns_to_lde is gated #[cfg(feature = "debug-checks")] but recompute_round1_from_trace appears to call it unconditionally. If that caller is not also cfg-gated (or the call site wrapped in #[cfg]), builds without debug-checks will fail. Please verify that CI exercises a build without this feature.

Low (1 item) — see inline comment

  1. CUDA path in columns_to_row_major panics on non-uniform column lengths — A debug_assert_eq! inside the inner loop would catch any future regression cheaply.

What looks good

  • The two-half FFT implementation and its bit-reversal bookkeeping are clean; the three-step structure (bit-rev → first-half layers → bit-rev → second-half layers → bit-rev) maps directly to the Plonky3 reference.
  • The LDE function validates all preconditions with proper Err returns before touching the buffer.
  • Differential tests cover base field, extension field, multiple column counts, and blowup factors — good coverage of the new code paths.
  • The rayon::scope + per-slot Option pattern in build_traces is verbose but correct; the PHASE5_RAN expect-string gives a clear panic message if the invariant breaks.
  • FxHashMap substitution for the dedup tables and the removal of the identity multiply in LogUp are both straightforward, low-risk cleanups.

@MauroToscano MauroToscano left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes pending the GPU-side review.

This reworks the CPU trace-LDE pipeline (row-major LDE + cache-blocked two-half FFT + per-domain precomputed coset twiddles), but the diff doesn't touch crypto/math-cuda — there are no GPU/CUDA changes here. The "byte-identical committed polynomial" claim has only been established for the CPU path; the same rework is not obviously equivalent on the GPU, so we need to confirm the GPU LDE/FFT produces the same commitment under the new layout/ordering (or port the rework to the GPU side) before this lands — otherwise CPU- and GPU-built proofs could diverge.

Holding merge until the GPU equivalence is sorted out (Juampi is looking at that side).

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@MauroToscano

Copy link
Copy Markdown
Contributor

/review-ai

@MauroToscano

Copy link
Copy Markdown
Contributor

/bench

@github-actions

Copy link
Copy Markdown

Codex Code Review

No issues found in the PR diff.

I reviewed the changed FFT/LDE paths, row-major trace storage/commitment changes, CUDA handle call sites, preprocessed split commitments, and trace-builder parallelization. I did not run builds or tests per the review constraints.

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: Row-major trace LDE + two-half FFT (PR #650)

Thorough static review of the full diff. No correctness, safety, or soundness bugs found. The PR is carefully engineered and well-tested. Summary of what I checked plus a couple of low-severity notes.

Safety / soundness — looks solid

  • in_place_bit_reverse_permute_row_major unsafe parallel path (bit_reversing.rs): the AtomicPtr + raw-pointer swap is sound. The disjointness argument is correct — for power-of-two n, bit-reverse is an involution, so the j > i filter selects one representative per 2-cycle, giving pairwise-disjoint num_cols-wide row ranges. The power-of-two precondition is enforced with a real assert! (not debug_assert!), so a bad caller panics instead of racing. Good call making that one non-debug.
  • Table::row_major_data unsafe mmap slice (table.rs): mirrors the existing get_row contract (repr-transparent + spill writes row-major), lifetime tied to &self. Consistent with existing usage.
  • fft_batch_two_half: validates num_cols, power-of-two n, and log_n == tw.log_n before any unsafe/indexing. Second-half twiddle indexing (first_block + b) stays within bitrev_tw (len n/2). Differential tests cover fwd+inv across log_n 2..=10, m in {1,3,7}, plus the ext3 path — strong coverage.
  • Byte-identical claim: commit_rows_bit_reversed produces the same leaf bytes (same row values, same column order, same reverse_index) as the column-major path; the LDE differential tests confirm per-cell equivalence. The GPU columns_to_row_major interleave preserves the exact value get_main/get_aux returned before, so downstream R2-R4 indexing is unchanged.

Concurrency / determinism

  • Parallel chunk_and_generate + the rayon::scope fan-out in build_traces: each closure reads disjoint op lists and writes its own slot; disk-spill correctly stays sequential to keep peak memory bounded. No data races; peak heap is unchanged (all tables were live at the end already).
  • FxHashMap swap and the acknowledged non-deterministic dedup-row order are sound for multiset lookup tables. Fine.

Minor (Low)

  • commit_rows_bit_reversed_subset (prover.rs): col_end <= num_cols is only a debug_assert!. In a release build, a caller passing col_end > num_cols would silently read into adjacent-row data (wrong commitment) rather than fail loudly. Callers are all internal with controlled args, so risk is low — but since this is a commitment primitive, a hard bounds check (or returning None) would be cheap insurance.
  • count_table_lengths scratch reuse + collect_register_ops_from_cpu now appending into a caller-owned Vec is semantically equivalent to the old extend; verified both call sites (accumulator path doesn't clear, scratch path clears first). Correct.

Dead code — handled well
Legacy LayerTwiddles fields, expand_columns_to_lde, and compute_lde_from_columns_cached are properly cfg-gated to test/test-utils/debug-checks, so release builds don't carry them. No leftover unreachable paths.

Nice work — the differential tests against the single-column Bowers oracle give good confidence in the FFT rewrite.

@github-actions

Copy link
Copy Markdown

AI Review

PR #650 · 19 changed files

Findings

Status Sev Location Finding Found by
confirmed low crypto/math/src/fft/bit_reversing.rs:66 Misleading comment about memory savings in bit_reversing.rs moonmath
zro/minimax-m3
confirmed low crypto/stark/src/prover.rs:324 LdeTwiddles builds two_half_inv/two_half_fwd unconditionally even when only legacy fields are used minimax
minimax/MiniMax-M3
confirmed low crypto/stark/src/prover.rs:646 commit_rows_bit_reversed_subset uses debug_assert for col_end bounds and divisibility glm
openrouter/z-ai/glm-5.2
minimax
minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-002: Misleading comment about memory savings in bit_reversing.rs
  • Status: confirmed
  • Severity: low
  • Location: crypto/math/src/fft/bit_reversing.rs:66
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The comment "saves ~16 MB at log21 n=64" appears to be off by a factor of 2 on 64-bit platforms and has a confusing/typo'd wording ("log21 n=64").

Evidence

At log_n=21, n=2^21=2097152. An upfront Vec<(usize, usize)> collection would use 16 bytes per pair on 64-bit (or 8 bytes on 32-bit). On 64-bit: 2^21 * 16 = 33554432 bytes ≈ 32 MB, not 16 MB. On 32-bit: 2^21 * 8 = 16 MB. The wording "log21 n=64" is unclear — likely meant "log_n=21, n=2^21=2M" or just "log_n=21".

Suggested fix

Update the comment to: "// No upfront Vec<(usize, usize)> collection (saves ~32 MB at log_n=21 on 64-bit)."

AI-006: LdeTwiddles builds two_half_inv/two_half_fwd unconditionally even when only legacy fields are used
  • Status: confirmed
  • Severity: low
  • Location: crypto/stark/src/prover.rs:324
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

TwoHalfTwiddles::new is always called in LdeTwiddles::new, including on builds that do not use it (e.g., debug-checks-only with precomputed test-utils)

Evidence

The struct LdeTwiddles keeps two_half_inv/two_half_fwd as unconditional fields, and LdeTwiddles::new builds them without any cfg gate (lines 324-327). In a debug-checks build that never calls coset_lde_full_expand_row_major, these twiddles are still allocated (the forward set for size n*blowup_factor is the largest single allocation in the struct).

Suggested fix

If the intent is to keep release builds free of the legacy per-column path, this is fine. Otherwise, gate the two-half allocations under a feature flag or only build them when coset_lde_full_expand_row_major will be called.

AI-007: commit_rows_bit_reversed_subset uses debug_assert for col_end bounds and divisibility
  • Status: confirmed
  • Severity: low
  • Location: crypto/stark/src/prover.rs:646
  • Found by: glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The debug_assert checks col_end &lt;= num_cols and data.len() % num_cols == 0 only in debug builds. In release builds, a malformed call with col_end > num_cols or a non-divisible data length would index out of bounds (row_start + col_end could exceed data.len()) causing a panic in slice indexing rather than a clean error. The production callers always pass valid inputs (total_cols from trace width), so this is not currently reachable with bad data.

Evidence

Lines 645-646 use debug_assert! for col_end <= num_cols and data.len() % num_cols == 0; line 668 slices data[row_start+col_start..row_start+col_end] which would panic on out-of-bounds in release if invariants were violated.

Suggested fix

Acceptable as-is given all in-tree callers pass valid widths. Could upgrade to a checked assert if this ever becomes a public/external API.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 3
kimi openrouter/moonshotai/kimi-k2.7-code general error: agentic lane timed out after 1800s 0
minimax minimax/MiniMax-M3 general success 6
moonmath zro/minimax-m3 general success 3
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 3 7 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (7) — rejected by the verifier
  • Removed test-flake targets in Makefile: test-no-compile and test-rust-no-compile (Makefile:3, found by minimax:minimax/MiniMax-M3) — The Makefile genuinely removes test-no-compile/test-asm-no-compile/test-rust-no-compile targets and simplifies test-executor to compile-programs then cargo test. The compile-programs targets were reworked with pattern rules that make incremental recompilation efficient, so the no-compile variants are no longer useful. This is an intentional design change, not a bug. The claim that it's 'not obvious from the diff alone' is commentary, not an issue.
  • Potentially confusing SAFETY comment about AtomicPtr load ordering (crypto/math/src/fft/bit_reversing.rs:68, found by moonmath:zro/minimax-m3, minimax:minimax/MiniMax-M3) — The Relaxed ordering on the AtomicPtr load is sound. The pointer is written via AtomicPtr::new (a non-atomic store) before into_par_iter() starts. All Rayon thread spawns happen after that write, and the Rayon scope guarantees the initial write is visible. This is a well-defined Rust pattern — Relaxed is sufficient when there is an external happens-before edge (thread spawn). Using Acquire would add unnecessary overhead. The claim about 'implementation-specific behavior' is incorrect; this is standard Rust memory model semantics.
  • Parallel bit-reverse raw-pointer path holds &mut slice aliasing across threads via AtomicPtr (crypto/math/src/fft/bit_reversing.rs:69, found by glm:openrouter/z-ai/glm-5.2) — This finding duplicates AI-003's concern about AtomicPtr + Relaxed ordering in the same parallel bit-reverse code path. The disjointness argument is sound: n is power-of-two (runtime asserted), bit-reverse is an involution, j>i filter selects disjoint pairs. The Relaxed load is correct because the AtomicPtr is written before par_iter starts, and thread spawn provides happens-before. The code explicitly documents this in the SAFETY comment.
  • Duplicate transpose in reconstruct_round1 (crypto/stark/src/prover.rs:988, found by moonmath:zro/minimax-m3) — The reconstruct_round1 debug path (prover.rs lines 978-996) LDE-expands columns in column-major form via expand_columns_to_lde, then manually transposes to row-major to feed from_row_major. The claim says from_columns 'already does this transpose internally' and using it would avoid duplication. But LDETraceTable::from_columns (trace.rs) does the exact same col→row transpose internally — there is no duplicate work, just a choice of where the transpose happens. The comment 'debug path: correctness over speed' correctly acknowledges this path is not performance-critical. This is not an issue.
  • spawn_into! macro moves a &mut slot into the rayon::scope closure (prover/src/tables/trace_builder.rs:3023, found by minimax:minimax/MiniMax-M3) — The spawn_into! macro (trace_builder.rs lines 3023-3028) moves a &mut reference to a local Option slot into a rayon::scope closure. This is a standard Rust pattern — each slot is a local variable declared in the preceding scope, and the borrow checker ensures they live long enough. The claim that it 'would break' if slots were different types is meaningless because Rust's type system would catch that at compile time. The code is idiomatic and correct.
  • FxHasher write() byte-wise path deviates from expected word-at-a-time hashing (prover/src/tables/types.rs:925, found by glm:openrouter/z-ai/glm-5.2) — FxHasher::write (types.rs lines 992-996) processes byte slices one byte at a time, calling self.add(b as u64) per byte. All current key types (BranchOperation, LtOperation, DvrmOperation, MulOperation) derive Hash over u64/bool fields, which Rust's derive macro dispatches to write_u64/write_u8 respectively — write() is never called for these keys. The byte-at-a-time path would be suboptimal for &[u8]/String fields, but no current key contains those. The code comment explicitly documents the fast-path design and that collisions only cost probes. The claim is not a bug, just an observation about a non-activated code path. The finding itself acknowledges this in its own evidence section.
  • Dead code: FxHasher is wrapped in a default BuildHasherDefault and never tested for hash quality (prover/src/tables/types.rs:979, found by minimax:minimax/MiniMax-M3) — The claim that FxHasher is 'never tested for hash quality' and 'operations with mostly-zero payload will all hash similarly' is speculative. The hash function rotates and multiplies by a fixed seed per word, which provides good bit mixing. For DvrmOperation with d=0, the hash state still accumulates values from the n (numerator), signed flag, and all other fields before finishing — the final hash will differ. The code explicitly documents that 'collisions only cost probes, never soundness' and the maps are bounded by max_rows per chunk. Without actual collision measurements, this is a performance hypothesis, not a bug.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@jotabulacios

Copy link
Copy Markdown
Collaborator Author

/bench

@MauroToscano MauroToscano added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit 3bb9107 Jun 26, 2026
12 checks passed
@MauroToscano MauroToscano deleted the perf/cpu-lde-rework branch June 26, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants