perf/row-major trace LDE for GPU path by jotabulacios · Pull Request #715 · yetanotherco/lambda_vm

jotabulacios · 2026-06-25T20:19:23Z

Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:

extract_columns_main()   ← row → column-major (strided reads)
columns_to_row_major()   ← column → row-major (after D2H)

Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.

New kernels

bit_reverse_row_major, ntt_dit_level_row_major, pointwise_mul_row_major: NTT pipeline on row-major buffers. threadIdx.x = column → consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).
keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements.
matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for the GpuLdeBase/GpuLdeExt3 handle (DEEP, barycentric expect column-major).

For ext3 (aux trace): Fp3 = [u64; 3] in memory, so row-major ext3 with M columns is identical to row-major base-field with m = M×3 — the same kernels handle both without modification.

Bug fixed

launch_keccak_base_row_major was launched with 1024 threads/block. The Keccak kernel's 25-element uint64_t state exceeds the per-block register budget → CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES → silent CPU fallback → no GpuLdeBase handle → R3 OOD and R4 DEEP also fell back. Fixed to 128 threads/block (matching the column-major path). Diagnostic: Main commit (Merkle, CPU only) ≈ 0.25s when GPU runs; ≈ 1.9s on fallback.

Results

fib_iterative_8M, RTX 5090, median of 4 runs:

	Baseline	This PR	Δ
Default TP	21.46s	19.43s	−2.0s (−9.5%)
TP=1	32.76s	29.46s	−3.3s (−10%)

All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.

GPU/CPU equivalence

The two-half FFT from #650 was prototyped for GPU and measured at zero improvement: it targets CPU cache thrash from large FFT strides, which doesn't apply on GPU since the row-major NTT already achieves coalesced access. Twiddle precomputation was already in the GPU backend.

Tests added to confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs: merkle_root_parity.rs (base + ext3), barycentric_cpu_gpu_parity.rs (base + ext3), cuda_path_integration::gpu_and_cpu_proofs_both_verify. cuda_path_integration 2/2 on hardware.

…evice transpose

…via on-device transpose" This reverts commit 38f5600.

…loc, trim stale comments

MauroToscano · 2026-06-25T22:08:20Z

/review-ai

github-actions · 2026-06-25T22:09:48Z

Codex Code Review

Findings

High - crypto/math-cuda/src/lde.rs:474 and crypto/math-cuda/src/lde.rs:582 return device handles immediately after enqueueing matrix_transpose_strided, with no stream synchronization. The returned handle is later consumed on streams selected independently by backend().next_stream(), and event tracking is disabled in Backend::init, so downstream barycentric/DEEP kernels can read col_major_dev before the transpose completes. There is also a lifetime risk because the source row-major buf is dropped as the function returns while the queued transpose may still be reading it. Synchronize after the transpose, or enqueue the transpose before an existing D2H+sync that drains the same stream.

Low - prover/tests/cuda_path_integration.rs:83 does not actually force the CPU path. gpu_lde_threshold() is cached with OnceLock in crypto/stark/src/gpu_lde.rs:43, and this test first runs prove() on the GPU path before setting LAMBDA_VM_GPU_LDE_THRESHOLD, so the later env var mutation is ignored. The test currently verifies two proofs but does not prove GPU-vs-CPU coverage. Set the env var before any threshold read, or add a test-only reset/injection path.

No other issues found in the reviewed diff.

claude · 2026-06-25T22:13:44Z

+    // Transpose row-major buf → column-major for the handle. Downstream kernels
+    // (DEEP, barycentric) expect buf[c * lde_size + r] (column-major).
+    let col_major_dev = launch_row_to_col_major(&stream, be, &buf, lde_size, m, lde_u64)?;
+


High — possible cross-stream race on the returned handle buffer.

launch_row_to_col_major enqueues the transpose into col_major_dev but does not synchronize (its doc says so explicitly), and this function returns the GpuLdeBase handle immediately afterward. Downstream consumers (barycentric.rs, deep.rs) each call be.next_stream() and read handle.buf on a different stream, with no event/sync against the transpose stream — so they can read col_major_dev while the transpose kernel is still in flight (or before it runs at all, since the buffer is alloc_zeros). That yields intermittently wrong OOD/DEEP evaluations.

The existing coset_lde_batch_base_into_with_merkle_tree_keep path does not have this problem: there the handle wraps buf, which is fully produced before the final stream.synchronize() in d2h_bytes_via_pinned_hashes, so the host blocks until the buffer is complete and globally visible. This new path adds a kernel after the last sync.

Add a stream.synchronize()? after the transpose (here and in the ext3 variant below) before constructing the handle, to restore the "handle buffer is ready" contract.

claude · 2026-06-25T22:14:00Z

+        Vec::from_raw_parts(
+            v.as_mut_ptr() as *mut FieldElement<E>,
+            v.len() / 3,
+            v.capacity() / 3,


Low — from_raw_parts capacity divisibility is an unstated invariant.

lde_u64 (a Vec<u64>) is reinterpreted as Vec<FieldElement<Fp3>> ([u64;3], 24 bytes) using len()/3 and capacity()/3. This is sound only if capacity() is a multiple of 3 — otherwise the dealloc layout ((capacity/3)*24) won't match the original allocation (capacity*8) → mismatched Layout → UB.

It holds today because the producing function builds the vec via pinned[..lde_size*m3].to_vec(), which currently yields capacity == len (a multiple of 3). But to_vec() capacity is not a contractual guarantee. A debug_assert!(v.capacity() % 3 == 0 && v.len() % 3 == 0) would pin the invariant.

claude · 2026-06-25T22:15:43Z

Review: row-major trace LDE for GPU path

Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the row < rev guard), butterflies touch disjoint rows per level, levels are separate (serialized) launches, and the tiled transpose syncs tile correctly with a block-uniform loop bound. Kernel index math stays within the allocated buffers, and assert_u32_domain guards the u32 grid/transpose dimensions.

High

Missing synchronize after the row->col transpose (crypto/math-cuda/src/lde.rs, base + ext3 *_with_merkle_tree_keep). The handle is returned wrapping col_major_dev while launch_row_to_col_major (intentionally) does not synchronize. Downstream OOD/DEEP rounds read handle.buf on a different next_stream() with no cross-stream ordering, so they can read the buffer mid-transpose -> intermittently wrong evaluations. The existing batched path does not have this because its handle buffer is complete before the final host synchronize(). A stream.synchronize()? after the transpose restores that contract. (inline)

Low

ext3 Vec::from_raw_parts(..., len/3, capacity/3) (gpu_lde.rs) is sound only while the source capacity stays a multiple of 3; relies on an unstated to_vec() property. A debug_assert! would pin it. (inline)
Dead code to confirm/remove: this PR replaces both call sites of the old try_expand_leaf_and_tree_batched_keep / try_expand_leaf_and_tree_batched_ext3_keep and columns_to_row_major. If those were their only callers, they are now dead and should be removed. (extract_columns_main/aux are presumably still used by the CPU fallback.)
Comment removals in lookup.rs (compute_fingerprint_from_step) and trace_builder.rs (build_traces) drop useful "why" context with no code change; minor readability regression.

The instruments relabeling and the two_half_fft/polynomial comment trims are fine.

github-actions · 2026-06-25T22:35:56Z

AI Review

PR #715 · 15 changed files

Findings

Status	Sev	Location	Finding	Found by
confirmed	critical	prover/tests/cuda_path_integration.rs:75	gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs	minimax minimax/MiniMax-M3 moonmath zro/minimax-m3 kimi openrouter/moonshotai/kimi-k2.7-code glm openrouter/z-ai/glm-5.2
confirmed	high	crypto/math-cuda/src/lde.rs:474	Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)	kimi openrouter/moonshotai/kimi-k2.7-code
confirmed	medium	crypto/math-cuda/tests/merkle_root_parity.rs:42	No direct GPU parity test for the new row-major LDE pipeline	moonmath zro/minimax-m3 glm openrouter/z-ai/glm-5.2 minimax minimax/MiniMax-M3
confirmed	medium	crypto/stark/src/gpu_lde.rs:459	try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code	minimax minimax/MiniMax-M3 moonmath zro/minimax-m3 glm openrouter/z-ai/glm-5.2 kimi openrouter/moonshotai/kimi-k2.7-code
confirmed	low	crypto/math-cuda/src/lde.rs:280	Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)	moonmath zro/minimax-m3
confirmed	low	prover/tests/cuda_path_integration.rs:83	Potentially unsafe std::env mutation in tests without --test-threads=1	moonmath zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs

Status: confirmed
Severity: critical
Location: prover/tests/cuda_path_integration.rs:75
Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new gpu_and_cpu_proofs_both_verify test claims to prove and verify a CPU-only proof, but the second prove() still runs on the GPU because gpu_lde_threshold() caches its first read in a OnceLock and ignores later changes to the LAMBDA_VM_GPU_LDE_THRESHOLD env var. The test passes only because GPU proofs are themselves valid, so it provides no CPU-path coverage while claiming to.

Evidence

crypto/stark/src/gpu_lde.rs:43-51 declares static CACHED: OnceLock<usize> = OnceLock::new(); and caches the threshold on first call. The cuda_path_integration test runs prove(&elf) for warm-up (which calls gpu_lde_threshold() and caches the default 2^19), then std::env::set_var("LAMBDA_VM_GPU_LDE_THRESHOLD", "999999999"), then calls prove(&elf) again — but the OnceLock still returns the original default, so the GPU path still fires. Both proof_gpu and proof_cpu are GPU proofs. The test will pass but cannot catch CPU-path regressions.

Suggested fix

Either change gpu_lde_threshold to re-read the env var on every call (drop the OnceLock, or invalidate it on env changes), or rewrite the test to force the CPU path via a different mechanism (e.g. compile-time cfg, or temporarily disable CUDA init, or run the test under a separate binary that starts with the env var already set). Do not ship the test as-is — its name and assertions are misleading.

AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)

Status: confirmed
Severity: high
Location: crypto/math-cuda/src/lde.rs:474
Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes.

Evidence

launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf.

Suggested fix

Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix.

AI-005: No direct GPU parity test for the new row-major LDE pipeline

Status: confirmed
Severity: medium
Location: crypto/math-cuda/tests/merkle_root_parity.rs:42
Found by: moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new merkle_root_parity.rs test only exercises the existing coset_lde_batch_base / coset_lde_batch_ext3_into paths. It does not exercise the new coset_lde_row_major_with_merkle_tree_keep / coset_lde_ext3_row_major_with_merkle_tree_keep functions introduced by this PR, so a regression in the new path would slip through the test gate.

Evidence

merkle_root_parity.rs gpu_merkle_root calls math_cuda::lde::coset_lde_batch_base and math_cuda::merkle::keccak_leaves_base (the batched, non-row-major functions), not coset_lde_row_major_with_merkle_tree_keep. gpu_ext3_merkle_root likewise uses coset_lde_batch_ext3_into + keccak_leaves_ext3. The row-major entry points added in this PR are not invoked by any test in the file.

Suggested fix

Add unit parity tests that call coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep directly and compare the Merkle root (and ideally the row-major LDE values) against the CPU coset_lde_full_expand_row_major + commit_rows_bit_reversed reference used in the test's CPU branch. The new keccak, NTT, and transpose kernels are easy targets for unit tests and would catch a regression in any of the three.

AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code

Status: confirmed
Severity: medium
Location: crypto/stark/src/gpu_lde.rs:459
Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, kimi:openrouter/moonshotai/kimi-k2.7-code
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

try_expand_leaf_and_tree_batched_keep (line 459) and try_expand_leaf_and_tree_batched_ext3_keep (line 652) are no longer called anywhere after the PR replaced the R1 prover call sites with the row-major variants. Their definitions (plus the comment blocks above them) are dead and should be deleted to avoid future drift.

Evidence

A grep across the workspace for try_expand_leaf_and_tree_batched_keep finds only the function definition itself — no remaining call sites in prover.rs or elsewhere. Similarly try_expand_leaf_and_tree_batched_ext3_keep at gpu_lde.rs:652 is only referenced by its own body. Their consumers coset_lde_batch_base_into_with_merkle_tree_keep (lde.rs:1023) and coset_lde_batch_ext3_into_with_merkle_tree_keep (lde.rs:1255) are likewise dead within the workspace.

Suggested fix

Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs.

AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)

Status: confirmed
Severity: low
Location: crypto/math-cuda/src/lde.rs:280
Found by: moonmath:zro/minimax-m3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

run_row_major_ntt_body casts m as u32, (n >> 1) as u32, n_u64 for grid.y, etc., without an assert_u32_domain-style guard. Only lde_size (= n * blowup_factor) is checked. If a future caller passes m or n > u32::MAX, the kernel launches will silently truncate and produce wrong results instead of failing fast.

Evidence

lde.rs:280-302 casts m as u32, (n >> 1) as u32, (m as u32).div_ceil(col_tile) in the launch config. lde.rs:230-241 similarly casts m as u32 in launch_bit_reverse_row_major/launch_pointwise_mul_row_major. Only lde_size is guarded by assert_u32_domain.

Suggested fix

Add assert!(m <= u32::MAX as u64, ...), assert!(n <= u32::MAX as u64, ...), and assert!(m3 <= u32::MAX as u64, ...) in the public entry points (or a helper) so out-of-range inputs panic instead of silently truncating.

AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1

Status: confirmed
Severity: low
Location: prover/tests/cuda_path_integration.rs:83
Found by: moonmath:zro/minimax-m3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via unsafe std::env::set_var. Rust's std::env documents this as not threadsafe, and cargo runs integration test binaries concurrently with library tests that may also touch the same env var through gpu_lde_threshold.

Evidence

std::env::set_var is unsafe since Rust 1.78 because environment variables are not threadsafe. The test comment claims "no other thread reads this env var during the test" but other test binaries (or even other tests in the same binary running in parallel) may invoke gpu_lde_threshold, which calls std::env::var on the same key — a concurrent read/write race.

Suggested fix

Run only this test (cargo test --test cuda_path_integration -- --ignored --nocapture) in a separate invocation, or use a synchronization mechanism (e.g. a process-level mutex), or refactor gpu_lde_threshold so it doesn't read env at runtime.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	success	3
kimi	openrouter/moonshotai/kimi-k2.7-code	general	success	3
minimax	minimax/MiniMax-M3	general	success	4
moonmath	zro/minimax-m3	general	success	6
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	success	6	1	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (1) — rejected by the verifier

matrix_transpose_strided reads uninitialized shared memory when cols < MTILE (crypto/math-cuda/kernels/ntt.cu:386, found by minimax:minimax/MiniMax-M3) — The matrix_transpose_strided kernel at ntt.cu:386-413 is correct. For any reader thread (tix, tiy) that satisfies tx < rows AND ty < cols, the corresponding writer thread at (tiy, tix) has x_writer = blockIdx.x*MTILE + tiy = ty < cols, and y_writer = row_base + tix = tx < rows. Both writer conditions are satisfied, so writer always writes tile[tiy][tix] before __syncthreads() and reader reads tile[tix][tiy] (= same transpose slot) after the barrier. The claimed uninitialized read cannot occur because the writer/reader pair always covers the same shared memory slot.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…andle

jotabulacios added 16 commits June 24, 2026 14:34

add gpu tests

94b3f29

Add GPU/CPU Merkle root parity test

8dc297f

Add GPU/CPU Merkle root parity tests for base and ext3 aux trace

b3d915d

Add GPU/CPU barycentric OOD parity tests

fd5da8b

Fix ext3 pre-strided layout in barycentric parity test

c795edc

Fix instruments double-billing GPU fused pipeline in R1

b3d7366

Add test verifying GPU and CPU proofs both pass verification

cb84e23

Clean up verbose comments in parity tests

627856b

GPU R1 GPU: eliminate extract_columns + columns_to_row_major via on-d…

38f5600

…evice transpose

Revert "GPU R1 GPU: eliminate extract_columns + columns_to_row_major …

85db0e9

…via on-device transpose" This reverts commit 38f5600.

GPU R1: row-major NTT kernel — no transpose, coalesced column access

74f97db

Fix GPU R1 row-major: transpose buf to col-major for device handle

71c310b

Fix keccak row-major launch config: use 128-thread block, not 1024

a52f737

GPU R1 aux: row-major ext3 NTT reusing base-field kernels with m*3

1fd3ad9

Clean up GPU row-major LDE: extract transpose helper, fix zero-pad al…

1b24c66

…loc, trim stale comments

Clean up GPU row-major LDE: extract helper, fix alloc, trim comments

ff24aae

jotabulacios marked this pull request as ready for review June 25, 2026 20:19

jotabulacios changed the title ~~Perf/cpu lde rework gpu~~ perf/row-major trace LDE for GPU path Jun 25, 2026

claude Bot reviewed Jun 25, 2026

View reviewed changes

jotabulacios added 5 commits June 25, 2026 22:22

Fix gpu_lde_threshold OnceLock: re-read env var in test builds

ff71a65

Fix cross-stream race: synchronize after transpose before returning h…

09bc2dc

…andle

Add parity tests for new row-major GPU pipeline

53af5d9

Remove dead batched-keep GPU LDE functions

2613718

fix lint

e5d5952

jotabulacios marked this pull request as draft June 26, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf/row-major trace LDE for GPU path#715

perf/row-major trace LDE for GPU path#715
jotabulacios wants to merge 21 commits into
perf/cpu-lde-reworkfrom
perf/cpu-lde-rework_gpu

jotabulacios commented Jun 25, 2026 •

edited

Loading

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jotabulacios commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New kernels

Bug fixed

Results

GPU/CPU equivalence

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Codex Code Review

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jotabulacios commented Jun 25, 2026 •

edited

Loading