Skip to content

Continuations (Approach 2): prove executions epoch-by-epoch#685

Open
nicole-graus wants to merge 65 commits into
mainfrom
continuations-local-to-global
Open

Continuations (Approach 2): prove executions epoch-by-epoch#685
nicole-graus wants to merge 65 commits into
mainfrom
continuations-local-to-global

Conversation

@nicole-graus

@nicole-graus nicole-graus commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Monolithic proving holds the whole execution trace in memory, so program size is bounded by RAM. Continuations split a long execution into fixed-size epochs that are proven independently, keeping peak memory flat as the program grows. Only memory and registers have to be linked across epoch boundaries.

What this does

  • Splits execution into N-cycle epochs, each proven on its own; tables are dropped from memory after each epoch.
  • Introduces the local-to-global (L2G) table: one row per memory cell an epoch touches. Inside an epoch it bookends the Memory bus (replacing PAGE, which is switched off in continuation epochs); across epochs it carries each cell's init/fini claims.
  • Links epochs for memory with one global proof over a GlobalMemory LogUp plus a global_memory table that anchors genesis (initial memory, recomputed from the ELF by the verifier — not prover-supplied) and finalization (final values).
  • Links epochs for registers by preprocessing the REGISTER FINI column (= the epoch's final register file) and reusing it as the next epoch's preprocessed INIT, so init(epoch i+1) == fini(epoch i) by construction. Registers are few and fixed, so no telescoping/global bus is needed.
  • Links epochs for the commit index (x254): each epoch indexes its committed bytes from the carried x254 (the COMMIT trace and the verifier's commit-bus offset both start from it), and the driver concatenates each epoch's committed bytes into the run-wide public output.
  • Binds every continuation proof to its statement in Fiat-Shamir (see below).
  • Per-epoch perf: skip PAGE where the L2G bookend covers it, carry the memory image forward instead of re-snapshotting, and store memory in dense per-page arrays.

Soundness mechanisms

  • Range checks on L2G columns — only what isn't already pinned by MEMW: cross-epoch-only fields (init_epoch, init_timestamp) via IsHalfword halfwords; value bytes via AreBytes; address/fini_timestamp need no extra check (matched against MEMW, like PAGE).
  • fini_epoch as a verifier-supplied per-table constant (1-based labels, genesis = 0) — the prover can't choose it.
  • Cross-epoch ordering init_epoch < fini_epoch, checked via IsB20[fini_epoch − 1 − init_epoch], preventing a row from referencing a future/own epoch to inject an unjustified value.
  • MU selector (1 on real rows, 0 on padding) gates every L2G interaction. Gating the epoch-local Memory bus is load-bearing: MU=0 on a genuinely-touched cell dangles its MEMW tokens, so the epoch proof fails — which forces MU=1 on every touched cell, forces every touching epoch into the cross-epoch chain, and so makes the chain complete and the prover-supplied finalization trustworthy. (An earlier variant gating MU only on the GlobalMemory bus was unsound — a prover could orphan a touched epoch and drop a write — and was reverted.)
  • Register binding: FINI is a preprocessed column, so the verifier recomputes its commitment and the prover can't deviate from the public value; the Memory bus (REG-C2 ↔ MEMW) pins that value to the register's real last write. The same value is reused as the next epoch's preprocessed INIT.
  • Fiat-Shamir statement binding: each epoch proof and the global proof seed their transcript with the statement before the challenges are drawn — each epoch binds the ELF, public output, table layout, and its epoch label (position); the global proof binds the ELF and the epoch count — so a proof can't be replayed at a different position or program. The monolithic encoding is unchanged.

Full write-up, including the MU-wiring designs, the register binding, the Fiat-Shamir binding, and the adversarial-review reasoning, in docs/continuations_design.md.

Status — draft

Memory and register continuity, the cross-epoch commit index, and a standalone (split) prover/verifier are complete and bound. prove_continuation emits a self-contained ContinuationProof; verify_continuation(elf, &bundle) checks it using only the bundle and the ELF — it enumerates the epochs itself and turns the cross-epoch bindings into explicit verifier checks rather than values reused from the prover's memory: the register/x254 chain (epoch i+1's INIT derived from epoch i's bound FINI), the L2G commitment binding, ELF-genesis for both registers and memory, and the epoch label/count from a trusted enumeration; table_counts is validated and per-epoch page configs are not trusted. The public output is reconstructed from the per-epoch bound commit slices. The bundle serializes with bincode (like a monolithic VmProof) and is driven from the CLI (prove / verify --continuations). (Adversarially reviewed: register/x254 chain, L2G root binding, and completeness-by-enumeration each survived a construct-a-break audit.)

Deferred:

  • Succinctness (recursion/aggregation) — the split verifier checks all N epoch proofs plus the global proof, so continuations keep peak prover memory flat but do not yet shrink proof size or verify time; a single succinct proof needs a recursion/aggregation layer.

Usage (CLI)

Continuations are driven through the existing prove / verify commands with --continuations (binary at target/release/cli, or cargo run --release -p cli --):

# prove → writes the serialized ContinuationProof bundle
cli prove <elf> --output proof.cont --continuations [--epoch-size N | --num-epochs K] [--blowup B]

# verify ← checks the bundle using only the bundle + ELF
cli verify proof.cont <elf> --continuations [--blowup B]
  • --epoch-size N — epoch length in cycles (rounded up to a power of two, min 4).
  • --num-epochs K — split into K epochs (epoch_size = ceil(cycles / K) via a cycle pre-pass). Mutually exclusive with --epoch-size.
  • Neither flag → defaults to 4 epochs. A fixed --epoch-size is what keeps peak prover memory flat as the program grows; the epoch-count modes are convenience defaults.
  • --blowup B — FRI blowup factor (power of two ≥ 2, default 2), same as the monolithic path. Security stays at 128-bit regardless (the FRI query count is derived from the blowup), so a higher blowup just trades more proving work for fewer queries. Applied to every epoch proof and the global proof, and must match between prove and verify.
  • verify --continuations needs only the bundle + ELF (the bundle carries the private inputs). It prints the reconstructed output and exits non-zero if invalid.

Example:

cli prove  executor/program_artifacts/asm/test_commit_split.elf -o /tmp/cont.proof --continuations --num-epochs 2
cli verify /tmp/cont.proof executor/program_artifacts/asm/test_commit_split.elf --continuations
# → Verification succeeded!  Output: aabbccdd

How to test

cargo test --release -p lambda-vm-prover continuation

…orst-case local-to-global

  memory stress benchmark
… count by running the executor only, as a no-proving proxy for monolithic proving memory
@nicole-graus nicole-graus marked this pull request as ready for review June 25, 2026 14:33
MauroToscano and others added 12 commits June 25, 2026 17:31
Keep the follow-up scoped to non-performance cleanup while preserving the soundness regression coverage.

- Add L2G/global-memory regression tests for MU selector behavior, chain truncation, l2g-root binding, and private-input continuations.

- Fix stale continuation/global-memory docs and comments.

- Replace bare x254 byte address literals with register_base_address(254).

- Remove the unused DEFAULT_EPOCH_SIZE constant and document run_epochs as a test/bench helper.
Replace the continuation CLI's raw --epoch-size / --num-epochs controls with --epoch-size-log2.

The CLI now computes an exact power-of-two epoch size directly, defaults to 2^20, rejects tiny log2 values below 18, and no longer runs a cycle-count pre-pass to split into a target epoch count.

Update the continuation design doc and help text with the ethrex 10-transfer memory sweep as guidance.
@yetanotherco yetanotherco deleted a comment from github-actions Bot Jun 26, 2026
@MauroToscano

Copy link
Copy Markdown
Contributor

/review-ai

@github-actions

Copy link
Copy Markdown

Codex Code Review

Found 2 issues in the PR diff:

  • Medium: prover/src/continuation.rs allows unbounded epoch counts, but prover/src/tables/local_to_global.rs only debug_assert!s the < 2^20 IsB20 epoch-gap bound and then truncates to 20 bits. With the library API’s minimum epoch_size_log2 = 2, an honest >4M-cycle run can exceed the bound, spend time proving, and produce an unverifiable bundle. Add a hard runtime cap before building epoch labels/boundaries, or make collect_bitwise_from_l2g return an error instead of truncating.

  • Low: bin/cli/src/main.rs supports --cycles with continuations, but it calls count_cycles, which uses Executor::run() and stores the entire log vector before continuation proving. That defeats the flat-memory reason to use continuations and can OOM on large runs. Count cycles by streaming resume_with_limit and accumulating lengths instead.

Static review only; I did not build or run tests per the requested constraints.

Comment thread prover/src/continuation.rs Outdated
is_first: bool,
is_final: bool,
) -> VmAirs {
let register_init_arg = if is_first { None } else { Some(register_init) };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register_init_arg (and therefore the is_first branch here) is dead. build_epoch_airs always passes register_preprocessed = Some(...) to VmAirs::new, and VmAirs::new only consults its register_init parameter in the else branch (when register_preprocessed is None). So this value is computed and threaded through but never used, and the if is_first wrongly implies epoch-0 register handling differs (it doesn't — genesis vs. carried init is already fully captured by the preprocessed commitment built from register_init). Suggest dropping register_init_arg/is_first from build_epoch_airs and passing None for VmAirs::new's register_init.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review: Continuations (Approach 2)

Reviewed the continuation prover/verifier, the L2G + GlobalMemory tables, the register/x254 cross-epoch binding, the Fiat-Shamir statement binding, paged_mem, and the executor/CLI wiring. This is a large, carefully-constructed change with strong documentation and an unusually good set of adversarial negative tests (dropped/reordered epochs, tampered reg_fini/l2g_root/private inputs, malformed lengths, inflated table counts, oversized inputs). Reviewed statically from the diff.

Soundness — no issues found

The main soundness arguments hold up:

  • Completeness by enumeration: epoch 0's INIT is bound to ELF genesis via the preprocessed REGISTER/GLOBAL_MEMORY commitments; the last epoch is built is_final (HALT included), so truncation or appended epochs break a bus or the HALT chip.
  • Ordering init_epoch less-than fini_epoch: the IsB20 check on fini_epoch-1-init_epoch is sound even against a malicious init_epoch — halfword columns bound it to under 2^32, and any init_epoch >= fini_epoch wraps near the field modulus, outside the 20-bit range. Within an epoch, touched_cells_from_memory_state yields one row per address (PagedMem dedups), so no fini-1-init underflow.
  • x254 / commit index chains via the carried register_init at X254_INDEX on both the COMMIT trace and compute_commit_bus_offset; per-epoch public_output is bus-bound.
  • L2G to global binding: verify_l2g_commitment_binding ties the first N global sub-proof roots to the epoch roots; prover-supplied boundary only selects the touched-page set, and both over- and under-claiming a page fail (self-cancel vs. unmatched GlobalMemory token).
  • Verifier robustness: untrusted reg_fini length and oversized private_inputs are validated before indexing, returning Ok(None)/Err rather than panicking.

Minor

  1. Dead/misleading code in build_epoch_airs (see inline): register_init_arg / is_first are computed but unused because register_preprocessed is always Some. Suggest removing.
  2. --cycles with --continuations runs a full executor.run() pre-pass holding the entire log stream in memory, partially defeating the flat-peak-memory goal. Diagnostic/opt-in, so low priority — worth a note in help text.
  3. epoch_touched_cells (trace_builder.rs) is now only used by tests; production reads traces.touched_memory_cells directly. It is pub so not flagged — consider cfg(test) or a test-only doc note.
  4. Monolithic --blowup Option to u8: the old None/prove_with_inputs path was already unreachable (clap default_value 2 always yielded Some(2)), so this is a no-op cleanup — just confirm with_blowup(2) matches the prior default options.

Nothing blocking from a correctness/safety standpoint.

@github-actions

Copy link
Copy Markdown

AI Review

PR #685 · 31 changed files

Warning: the diff was truncated before review.

Findings

Status Sev Location Finding Found by
confirmed high prover/src/continuation.rs:279 Continuation proof bundle leaks private inputs in the clear kimi
openrouter/moonshotai/kimi-k2.7-code
nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
confirmed low bin/cli/src/main.rs:696 Continuation bundle deserialization lacks size limits on untrusted fields kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed low prover/src/continuation.rs:306 Dead code: register_init_arg and is_first in build_epoch_airs moonmath
zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Continuation proof bundle leaks private inputs in the clear
  • Status: confirmed
  • Severity: high
  • Location: prover/src/continuation.rs:279
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code, nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

ContinuationProof serializes the full private_inputs vector and the standalone verifier uses it to rebuild private-input page genesis. Exposing the witness in the proof bundle breaks the confidentiality expected of private inputs, and it is a regression from the monolithic VmProof path where private-input pages are kept non-preprocessed.

Evidence

prover/src/continuation.rs:279 declares private_inputs: Vec&lt;u8&gt; inside ContinuationProof; verify_continuation passes it to verify_global, which calls global_memory_configs(..., &amp;bundle.private_inputs). bin/cli/src/main.rs:696 deserializes this bundle for verification, and test_split_verify_rejects_tampered_private_input_genesis demonstrates correctness depends on these exact bytes.

Suggested fix

Do not ship raw private inputs in the self-contained proof bundle. Carry the precomputed private-input page commitments (or a single commitment) and have the verifier check the global-memory preprocessed columns against them, or explicitly document that continuation proofs reveal private inputs.

AI-007: Continuation bundle deserialization lacks size limits on untrusted fields
  • Status: confirmed
  • Severity: low
  • Location: bin/cli/src/main.rs:696
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

cmd_verify_continuation deserializes an untrusted ContinuationProof with bincode::deserialize before validating most field sizes. A malicious bundle could carry very large public_output, boundary, or epoch vectors and cause excessive memory allocation or OOM before verification rejects them.

Evidence

bin/cli/src/main.rs:696 calls bincode::deserialize(&amp;proof_bytes). verify_continuation only validates private_inputs length and reg_fini length; it does not bound the number of epochs, boundary sizes, or per-epoch public outputs before processing them.

Suggested fix

Add upfront size/length caps when deserializing or before processing (num_epochs, public_output bytes per epoch, boundary entries), or switch to a bounded/streaming bincode config for untrusted bundles.

AI-011: Dead code: register_init_arg and is_first in build_epoch_airs
  • Status: confirmed
  • Severity: low
  • Location: prover/src/continuation.rs:306
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

register_init_arg is computed from is_first and register_init but is never used in this code path: register_preprocessed is always Some(...) on the next two lines, and VmAirs::new only consumes register_init when register_preprocessed is None. So both is_first (in build_epoch_airs / verify_epoch) and the local register_init_arg are dead. The cross-epoch register carry is enforced entirely via the preprocessed INIT/FINI commitment, not via is_first.

Evidence

continuation.rs:306 computes register_init_arg = if is_first { None } else { Some(register_init) }; but the very next block sets register_preprocessed = Some(...) and VmAirs::new only reads register_init in the register_preprocessed.is_none() branch (lib.rs:506-516). is_first is threaded through verify_epoch, prove_continuation, and verify_continuation solely to compute this dead value.

Suggested fix

Either drop the is_first parameter from build_epoch_airs / EpochStart / verify_epoch (and the matching call sites), or remove register_preprocessed and use register_init for the monolithic fallback to give is_first real meaning. The current state is misleading to readers.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general error: agentic lane timed out after 1800s 0
kimi openrouter/moonshotai/kimi-k2.7-code general success 3
minimax minimax/MiniMax-M3 general success 2
moonmath zro/minimax-m3 general success 5
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general success 5

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 3 9 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (9) — rejected by the verifier
  • Inconsistent epoch size validation between CLI and core prover (bin/cli/src/main.rs:155, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3) — The CLI enforces epoch_size_log2 >= 18 (main.rs:22, validated at line 769) while the core library enforces >= 2 (continuation.rs:645). However, the CLI's stricter bound is a documented UX guardrail: the README states 'values below 18 are rejected' and the CLI help explains tiny epochs are dominated by fixed overhead. The library tests call prove_continuation directly (not through CLI), so the CLI limit doesn't apply to them. This is intentional separation of concerns, not an inconsistency bug.
  • PagedMem refactor changes hot path memory tracking without explicit correctness tests (prover/src/paged_mem.rs:1, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The claim is that there are 'no explicit integration tests comparing PagedMem vs HashMap behavior.' PagedMem has its own unit tests in paged_mem.rs and is exercised end-to-end by all continuation tests (which pass with epoch_size_log2=3,4,5). The finding identifies no actual bug or behavioral difference - it is purely speculative about what might go wrong. The test suite provides adequate coverage for the data structure change.
  • README example for continuation cycles is slightly stale (bin/cli/README.md:66, found by moonmath:zro/minimax-m3) — The finding itself says 'Cosmetic only.' The README correctly documents that --cycles is 'Also supported with --continuations' and includes a concrete example command (cargo run -p cli --release -- prove program.elf -o /tmp/cont.bin --continuations --cycles). The claim that the prose is 'awkward' is purely a stylistic opinion, not a code or documentation defect.
  • cmd_prove_continuation runs the executor twice when --cycles is set (bin/cli/src/main.rs:595, found by minimax:minimax/MiniMax-M3) — This is intentional, documented behavior. The --cycles flag is documented as 'Run one extra execution outside the timer and print the dynamic instruction count.' The same dual-execution pattern exists in cmd_prove for monolithic proofs (the cycle count is a pre-pass before the timed prove). The finding correctly observes two executions happen, but this is by design to separate metric collection from proving time measurement.
  • run_epochs clones the entire Memory at every epoch boundary (executor/src/vm/execution.rs:160, found by moonmath:zro/minimax-m3) — The docstring at execution.rs:158-159 says 'Test/bench helper — the production continuation prover streams epochs via resume_with_limit directly.' The finding itself acknowledges this and notes the test uses 'basic_program.elf, which is small enough to be fine.' The warning about future misuse is purely speculative. There is no current bug or misuse of this test helper.
  • Executor::run_epochs panics on zero epoch_size instead of returning an error (executor/src/vm/execution.rs:161, found by kimi:openrouter/moonshotai/kimi-k2.7-code) — run_epochs at execution.rs:158-159 is documented as a 'Test/bench helper.' Using assert! for input validation in test helpers is standard Rust practice - it catches misuse at the call site with a clear message rather than silently returning an error that a test might ignore. This is not a production API; the production path uses resume_with_limit directly.
  • Missing validation that epoch_size is power of two in core (only enforced by log2 API) (prover/src/continuation.rs:650, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The code computes epoch_size = 1 << epoch_size_log2 (continuation.rs:646), which is always a power of two by construction. The API uses epoch_size_log2 specifically to enforce power-of-two sizing (as the design doc states at §3.5). The finding speculates about a hypothetical future API change to raw epoch_size - this is not a current issue.
  • Documentation/test drift: x254 carry regression test comment assumes brittle program layout (prover/src/continuation.rs:1064, found by moonmath:zro/minimax-m3) — The test_commit_across_epochs_verifies test (continuation.rs:1064) does assert bundle.num_epochs() > 1 and then verifies the output bytes equal the expected [0xAA, 0xBB, 0xCC, 0xDD]. If a maintainer adds cycles to the ASM program, the assertion would still pass (or fail if they remove cycles), and the output verification confirms correctness. The test_ecsm_split test similarly verifies the split works via prove_and_verify. These tests are robust: they verify the actual output bytes, not just that epochs exist. The comments about cycle counts are documentation, not brittle assertions.
  • Two implementations of initial memory image building could diverge (prover/src/tables/trace_builder.rs:1773, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — Both build_initial_image (trace_builder.rs:1856, HashMap) and build_initial_image_paged (trace_builder.rs:1881, PagedMem) implement the same logic: iterate ELF segments, split words into bytes, insert into the store, then add private input bytes. The docstring on build_initial_image_paged says the 'byte values are identical to build_initial_image.' While code duplication is a valid code smell, there is no current divergence and the comment explicitly calls out the equivalence. This is a maintenance concern, not a bug introduced by this PR.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…734)

The cross-epoch ordering check proves `init_epoch < fini_epoch` via an IsB20
(20-bit) lookup on `fini_epoch - 1 - init_epoch`, so a run can have at most
2^20 epochs. Beyond that the IsB20 bus cannot balance and no honest proof
exists. Previously this was guarded only by a debug_assert in the prover's
bitwise emission, so a release build would build an unprovable trace and fail
cryptically — reachable via the library API with a small epoch size (the CLI's
min epoch size keeps it out of reach there).

Add a hard check in `prove_continuation`'s epoch loop returning
`Error::InvalidContinuationEpochSize` with a clear message once the epoch count
would exceed the range. This is a prover-side guard only: the verifier already
rejects any such proof (the IsB20 table is preprocessed and the ordering sender
is rebuilt verifier-side from a positional epoch label), so soundness is
unchanged — it just turns a confusing failure into a clean error.

Introduce `local_to_global::MAX_EPOCHS` as the single source of truth, used by
both the new check and the existing debug_assert (replacing the `1 << 20`
literal).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants