Skip to content

perf/row-major trace LDE for GPU path#715

Draft
jotabulacios wants to merge 21 commits into
perf/cpu-lde-reworkfrom
perf/cpu-lde-rework_gpu
Draft

perf/row-major trace LDE for GPU path#715
jotabulacios wants to merge 21 commits into
perf/cpu-lde-reworkfrom
perf/cpu-lde-rework_gpu

Conversation

@jotabulacios

@jotabulacios jotabulacios commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:

extract_columns_main()   ← row → column-major (strided reads)
columns_to_row_major()   ← column → row-major (after D2H)

Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.

New kernels

  • bit_reverse_row_major, ntt_dit_level_row_major, pointwise_mul_row_major: NTT pipeline on row-major buffers. threadIdx.x = column → consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).
  • keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements.
  • matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for the GpuLdeBase/GpuLdeExt3 handle (DEEP, barycentric expect column-major).

For ext3 (aux trace): Fp3 = [u64; 3] in memory, so row-major ext3 with M columns is identical to row-major base-field with m = M×3 — the same kernels handle both without modification.

Bug fixed

launch_keccak_base_row_major was launched with 1024 threads/block. The Keccak kernel's 25-element uint64_t state exceeds the per-block register budget → CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES → silent CPU fallback → no GpuLdeBase handle → R3 OOD and R4 DEEP also fell back. Fixed to 128 threads/block (matching the column-major path). Diagnostic: Main commit (Merkle, CPU only) ≈ 0.25s when GPU runs; ≈ 1.9s on fallback.

Results

fib_iterative_8M, RTX 5090, median of 4 runs:

Baseline This PR Δ
Default TP 21.46s 19.43s −2.0s (−9.5%)
TP=1 32.76s 29.46s −3.3s (−10%)

All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.

GPU/CPU equivalence

The two-half FFT from #650 was prototyped for GPU and measured at zero improvement: it targets CPU cache thrash from large FFT strides, which doesn't apply on GPU since the row-major NTT already achieves coalesced access. Twiddle precomputation was already in the GPU backend.

Tests added to confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs: merkle_root_parity.rs (base + ext3), barycentric_cpu_gpu_parity.rs (base + ext3), cuda_path_integration::gpu_and_cpu_proofs_both_verify. cuda_path_integration 2/2 on hardware.

@jotabulacios jotabulacios marked this pull request as ready for review June 25, 2026 20:19
@jotabulacios jotabulacios changed the title Perf/cpu lde rework gpu perf/row-major trace LDE for GPU path Jun 25, 2026
@MauroToscano

Copy link
Copy Markdown
Contributor

/review-ai

@github-actions

Copy link
Copy Markdown

Codex Code Review

Findings

High - crypto/math-cuda/src/lde.rs:474 and crypto/math-cuda/src/lde.rs:582 return device handles immediately after enqueueing matrix_transpose_strided, with no stream synchronization. The returned handle is later consumed on streams selected independently by backend().next_stream(), and event tracking is disabled in Backend::init, so downstream barycentric/DEEP kernels can read col_major_dev before the transpose completes. There is also a lifetime risk because the source row-major buf is dropped as the function returns while the queued transpose may still be reading it. Synchronize after the transpose, or enqueue the transpose before an existing D2H+sync that drains the same stream.

Low - prover/tests/cuda_path_integration.rs:83 does not actually force the CPU path. gpu_lde_threshold() is cached with OnceLock in crypto/stark/src/gpu_lde.rs:43, and this test first runs prove() on the GPU path before setting LAMBDA_VM_GPU_LDE_THRESHOLD, so the later env var mutation is ignored. The test currently verifies two proofs but does not prove GPU-vs-CPU coverage. Set the env var before any threshold read, or add a test-only reset/injection path.

No other issues found in the reviewed diff.

// Transpose row-major buf → column-major for the handle. Downstream kernels
// (DEEP, barycentric) expect buf[c * lde_size + r] (column-major).
let col_major_dev = launch_row_to_col_major(&stream, be, &buf, lde_size, m, lde_u64)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — possible cross-stream race on the returned handle buffer.

launch_row_to_col_major enqueues the transpose into col_major_dev but does not synchronize (its doc says so explicitly), and this function returns the GpuLdeBase handle immediately afterward. Downstream consumers (barycentric.rs, deep.rs) each call be.next_stream() and read handle.buf on a different stream, with no event/sync against the transpose stream — so they can read col_major_dev while the transpose kernel is still in flight (or before it runs at all, since the buffer is alloc_zeros). That yields intermittently wrong OOD/DEEP evaluations.

The existing coset_lde_batch_base_into_with_merkle_tree_keep path does not have this problem: there the handle wraps buf, which is fully produced before the final stream.synchronize() in d2h_bytes_via_pinned_hashes, so the host blocks until the buffer is complete and globally visible. This new path adds a kernel after the last sync.

Add a stream.synchronize()? after the transpose (here and in the ext3 variant below) before constructing the handle, to restore the "handle buffer is ready" contract.

Vec::from_raw_parts(
v.as_mut_ptr() as *mut FieldElement<E>,
v.len() / 3,
v.capacity() / 3,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — from_raw_parts capacity divisibility is an unstated invariant.

lde_u64 (a Vec<u64>) is reinterpreted as Vec<FieldElement<Fp3>> ([u64;3], 24 bytes) using len()/3 and capacity()/3. This is sound only if capacity() is a multiple of 3 — otherwise the dealloc layout ((capacity/3)*24) won't match the original allocation (capacity*8) → mismatched Layout → UB.

It holds today because the producing function builds the vec via pinned[..lde_size*m3].to_vec(), which currently yields capacity == len (a multiple of 3). But to_vec() capacity is not a contractual guarantee. A debug_assert!(v.capacity() % 3 == 0 && v.len() % 3 == 0) would pin the invariant.

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: row-major trace LDE for GPU path

Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the row < rev guard), butterflies touch disjoint rows per level, levels are separate (serialized) launches, and the tiled transpose syncs tile correctly with a block-uniform loop bound. Kernel index math stays within the allocated buffers, and assert_u32_domain guards the u32 grid/transpose dimensions.

High

  • Missing synchronize after the row->col transpose (crypto/math-cuda/src/lde.rs, base + ext3 *_with_merkle_tree_keep). The handle is returned wrapping col_major_dev while launch_row_to_col_major (intentionally) does not synchronize. Downstream OOD/DEEP rounds read handle.buf on a different next_stream() with no cross-stream ordering, so they can read the buffer mid-transpose -> intermittently wrong evaluations. The existing batched path does not have this because its handle buffer is complete before the final host synchronize(). A stream.synchronize()? after the transpose restores that contract. (inline)

Low

  • ext3 Vec::from_raw_parts(..., len/3, capacity/3) (gpu_lde.rs) is sound only while the source capacity stays a multiple of 3; relies on an unstated to_vec() property. A debug_assert! would pin it. (inline)
  • Dead code to confirm/remove: this PR replaces both call sites of the old try_expand_leaf_and_tree_batched_keep / try_expand_leaf_and_tree_batched_ext3_keep and columns_to_row_major. If those were their only callers, they are now dead and should be removed. (extract_columns_main/aux are presumably still used by the CPU fallback.)
  • Comment removals in lookup.rs (compute_fingerprint_from_step) and trace_builder.rs (build_traces) drop useful "why" context with no code change; minor readability regression.

The instruments relabeling and the two_half_fft/polynomial comment trims are fine.

@github-actions

Copy link
Copy Markdown

AI Review

PR #715 · 15 changed files

Findings

Status Sev Location Finding Found by
confirmed critical prover/tests/cuda_path_integration.rs:75 gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
kimi
openrouter/moonshotai/kimi-k2.7-code
glm
openrouter/z-ai/glm-5.2
confirmed high crypto/math-cuda/src/lde.rs:474 Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose) kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed medium crypto/math-cuda/tests/merkle_root_parity.rs:42 No direct GPU parity test for the new row-major LDE pipeline moonmath
zro/minimax-m3
glm
openrouter/z-ai/glm-5.2
minimax
minimax/MiniMax-M3
confirmed medium crypto/stark/src/gpu_lde.rs:459 try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
glm
openrouter/z-ai/glm-5.2
kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed low crypto/math-cuda/src/lde.rs:280 Implicit u32 truncation of m / n in row-major launch configs (no domain check) moonmath
zro/minimax-m3
confirmed low prover/tests/cuda_path_integration.rs:83 Potentially unsafe std::env mutation in tests without --test-threads=1 moonmath
zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs
  • Status: confirmed
  • Severity: critical
  • Location: prover/tests/cuda_path_integration.rs:75
  • Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new gpu_and_cpu_proofs_both_verify test claims to prove and verify a CPU-only proof, but the second prove() still runs on the GPU because gpu_lde_threshold() caches its first read in a OnceLock and ignores later changes to the LAMBDA_VM_GPU_LDE_THRESHOLD env var. The test passes only because GPU proofs are themselves valid, so it provides no CPU-path coverage while claiming to.

Evidence

crypto/stark/src/gpu_lde.rs:43-51 declares static CACHED: OnceLock&lt;usize&gt; = OnceLock::new(); and caches the threshold on first call. The cuda_path_integration test runs prove(&amp;elf) for warm-up (which calls gpu_lde_threshold() and caches the default 2^19), then std::env::set_var("LAMBDA_VM_GPU_LDE_THRESHOLD", "999999999"), then calls prove(&amp;elf) again — but the OnceLock still returns the original default, so the GPU path still fires. Both proof_gpu and proof_cpu are GPU proofs. The test will pass but cannot catch CPU-path regressions.

Suggested fix

Either change gpu_lde_threshold to re-read the env var on every call (drop the OnceLock, or invalidate it on env changes), or rewrite the test to force the CPU path via a different mechanism (e.g. compile-time cfg, or temporarily disable CUDA init, or run the test under a separate binary that starts with the env var already set). Do not ship the test as-is — its name and assertions are misleading.

AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)
  • Status: confirmed
  • Severity: high
  • Location: crypto/math-cuda/src/lde.rs:474
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes.

Evidence

launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf.

Suggested fix

Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix.

AI-005: No direct GPU parity test for the new row-major LDE pipeline
  • Status: confirmed
  • Severity: medium
  • Location: crypto/math-cuda/tests/merkle_root_parity.rs:42
  • Found by: moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new merkle_root_parity.rs test only exercises the existing coset_lde_batch_base / coset_lde_batch_ext3_into paths. It does not exercise the new coset_lde_row_major_with_merkle_tree_keep / coset_lde_ext3_row_major_with_merkle_tree_keep functions introduced by this PR, so a regression in the new path would slip through the test gate.

Evidence

merkle_root_parity.rs gpu_merkle_root calls math_cuda::lde::coset_lde_batch_base and math_cuda::merkle::keccak_leaves_base (the batched, non-row-major functions), not coset_lde_row_major_with_merkle_tree_keep. gpu_ext3_merkle_root likewise uses coset_lde_batch_ext3_into + keccak_leaves_ext3. The row-major entry points added in this PR are not invoked by any test in the file.

Suggested fix

Add unit parity tests that call coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep directly and compare the Merkle root (and ideally the row-major LDE values) against the CPU coset_lde_full_expand_row_major + commit_rows_bit_reversed reference used in the test's CPU branch. The new keccak, NTT, and transpose kernels are easy targets for unit tests and would catch a regression in any of the three.

AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code
  • Status: confirmed
  • Severity: medium
  • Location: crypto/stark/src/gpu_lde.rs:459
  • Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

try_expand_leaf_and_tree_batched_keep (line 459) and try_expand_leaf_and_tree_batched_ext3_keep (line 652) are no longer called anywhere after the PR replaced the R1 prover call sites with the row-major variants. Their definitions (plus the comment blocks above them) are dead and should be deleted to avoid future drift.

Evidence

A grep across the workspace for try_expand_leaf_and_tree_batched_keep finds only the function definition itself — no remaining call sites in prover.rs or elsewhere. Similarly try_expand_leaf_and_tree_batched_ext3_keep at gpu_lde.rs:652 is only referenced by its own body. Their consumers coset_lde_batch_base_into_with_merkle_tree_keep (lde.rs:1023) and coset_lde_batch_ext3_into_with_merkle_tree_keep (lde.rs:1255) are likewise dead within the workspace.

Suggested fix

Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs.

AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)
  • Status: confirmed
  • Severity: low
  • Location: crypto/math-cuda/src/lde.rs:280
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

run_row_major_ntt_body casts m as u32, (n &gt;&gt; 1) as u32, n_u64 for grid.y, etc., without an assert_u32_domain-style guard. Only lde_size (= n * blowup_factor) is checked. If a future caller passes m or n > u32::MAX, the kernel launches will silently truncate and produce wrong results instead of failing fast.

Evidence

lde.rs:280-302 casts m as u32, (n &gt;&gt; 1) as u32, (m as u32).div_ceil(col_tile) in the launch config. lde.rs:230-241 similarly casts m as u32 in launch_bit_reverse_row_major/launch_pointwise_mul_row_major. Only lde_size is guarded by assert_u32_domain.

Suggested fix

Add assert!(m &lt;= u32::MAX as u64, ...), assert!(n &lt;= u32::MAX as u64, ...), and assert!(m3 &lt;= u32::MAX as u64, ...) in the public entry points (or a helper) so out-of-range inputs panic instead of silently truncating.

AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1
  • Status: confirmed
  • Severity: low
  • Location: prover/tests/cuda_path_integration.rs:83
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via unsafe std::env::set_var. Rust's std::env documents this as not threadsafe, and cargo runs integration test binaries concurrently with library tests that may also touch the same env var through gpu_lde_threshold.

Evidence

std::env::set_var is unsafe since Rust 1.78 because environment variables are not threadsafe. The test comment claims "no other thread reads this env var during the test" but other test binaries (or even other tests in the same binary running in parallel) may invoke gpu_lde_threshold, which calls std::env::var on the same key — a concurrent read/write race.

Suggested fix

Run only this test (cargo test --test cuda_path_integration -- --ignored --nocapture) in a separate invocation, or use a synchronization mechanism (e.g. a process-level mutex), or refactor gpu_lde_threshold so it doesn't read env at runtime.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 3
kimi openrouter/moonshotai/kimi-k2.7-code general success 3
minimax minimax/MiniMax-M3 general success 4
moonmath zro/minimax-m3 general success 6
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 6 1 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (1) — rejected by the verifier
  • matrix_transpose_strided reads uninitialized shared memory when cols < MTILE (crypto/math-cuda/kernels/ntt.cu:386, found by minimax:minimax/MiniMax-M3) — The matrix_transpose_strided kernel at ntt.cu:386-413 is correct. For any reader thread (tix, tiy) that satisfies tx < rows AND ty < cols, the corresponding writer thread at (tiy, tix) has x_writer = blockIdx.x*MTILE + tiy = ty < cols, and y_writer = row_base + tix = tx < rows. Both writer conditions are satisfied, so writer always writes tile[tiy][tix] before __syncthreads() and reader reads tile[tix][tiy] (= same transpose slot) after the barrier. The claimed uninitialized read cannot occur because the writer/reader pair always covers the same shared memory slot.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@jotabulacios jotabulacios marked this pull request as draft June 26, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants