perf/row-major trace LDE for GPU path#715
Conversation
…via on-device transpose" This reverts commit 38f5600.
…loc, trim stale comments
|
/review-ai |
Codex Code ReviewFindings High - Low - No other issues found in the reviewed diff. |
| // Transpose row-major buf → column-major for the handle. Downstream kernels | ||
| // (DEEP, barycentric) expect buf[c * lde_size + r] (column-major). | ||
| let col_major_dev = launch_row_to_col_major(&stream, be, &buf, lde_size, m, lde_u64)?; | ||
|
|
There was a problem hiding this comment.
High — possible cross-stream race on the returned handle buffer.
launch_row_to_col_major enqueues the transpose into col_major_dev but does not synchronize (its doc says so explicitly), and this function returns the GpuLdeBase handle immediately afterward. Downstream consumers (barycentric.rs, deep.rs) each call be.next_stream() and read handle.buf on a different stream, with no event/sync against the transpose stream — so they can read col_major_dev while the transpose kernel is still in flight (or before it runs at all, since the buffer is alloc_zeros). That yields intermittently wrong OOD/DEEP evaluations.
The existing coset_lde_batch_base_into_with_merkle_tree_keep path does not have this problem: there the handle wraps buf, which is fully produced before the final stream.synchronize() in d2h_bytes_via_pinned_hashes, so the host blocks until the buffer is complete and globally visible. This new path adds a kernel after the last sync.
Add a stream.synchronize()? after the transpose (here and in the ext3 variant below) before constructing the handle, to restore the "handle buffer is ready" contract.
| Vec::from_raw_parts( | ||
| v.as_mut_ptr() as *mut FieldElement<E>, | ||
| v.len() / 3, | ||
| v.capacity() / 3, |
There was a problem hiding this comment.
Low — from_raw_parts capacity divisibility is an unstated invariant.
lde_u64 (a Vec<u64>) is reinterpreted as Vec<FieldElement<Fp3>> ([u64;3], 24 bytes) using len()/3 and capacity()/3. This is sound only if capacity() is a multiple of 3 — otherwise the dealloc layout ((capacity/3)*24) won't match the original allocation (capacity*8) → mismatched Layout → UB.
It holds today because the producing function builds the vec via pinned[..lde_size*m3].to_vec(), which currently yields capacity == len (a multiple of 3). But to_vec() capacity is not a contractual guarantee. A debug_assert!(v.capacity() % 3 == 0 && v.len() % 3 == 0) would pin the invariant.
|
Review: row-major trace LDE for GPU path Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the High
Low
The instruments relabeling and the two_half_fft/polynomial comment trims are fine. |
AI ReviewPR #715 · 15 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs
Claim The new Evidence crypto/stark/src/gpu_lde.rs:43-51 declares Suggested fix Either change AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)
Claim coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes. Evidence launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf. Suggested fix Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix. AI-005: No direct GPU parity test for the new row-major LDE pipeline
Claim The new Evidence merkle_root_parity.rs Suggested fix Add unit parity tests that call AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code
Claim
Evidence A grep across the workspace for Suggested fix Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs. AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)
Claim
Evidence lde.rs:280-302 casts Suggested fix Add AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1
Claim The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via Evidence std::env::set_var is Suggested fix Run only this test ( Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Discarded candidates (1) — rejected by the verifier
Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:
Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.
New kernels
bit_reverse_row_major,ntt_dit_level_row_major,pointwise_mul_row_major: NTT pipeline on row-major buffers.threadIdx.x = column→ consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements.matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for theGpuLdeBase/GpuLdeExt3handle (DEEP, barycentric expect column-major).For ext3 (aux trace):
Fp3 = [u64; 3]in memory, so row-major ext3 with M columns is identical to row-major base-field withm = M×3— the same kernels handle both without modification.Bug fixed
launch_keccak_base_row_majorwas launched with 1024 threads/block. The Keccak kernel's 25-elementuint64_tstate exceeds the per-block register budget →CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES→ silent CPU fallback → noGpuLdeBasehandle → R3 OOD and R4 DEEP also fell back. Fixed to 128 threads/block (matching the column-major path). Diagnostic:Main commit (Merkle, CPU only)≈ 0.25s when GPU runs; ≈ 1.9s on fallback.Results
fib_iterative_8M, RTX 5090, median of 4 runs:All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.
GPU/CPU equivalence
The two-half FFT from #650 was prototyped for GPU and measured at zero improvement: it targets CPU cache thrash from large FFT strides, which doesn't apply on GPU since the row-major NTT already achieves coalesced access. Twiddle precomputation was already in the GPU backend.
Tests added to confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs:
merkle_root_parity.rs(base + ext3),barycentric_cpu_gpu_parity.rs(base + ext3),cuda_path_integration::gpu_and_cpu_proofs_both_verify.cuda_path_integration2/2 on hardware.