Perf/trace gen parallel#707
Draft
diegokingston wants to merge 6 commits into
Draft
Conversation
Collaborator
Author
|
/bench 5 |
The spec types LT/BRANCH μ as a Bit (lt.toml, branch.toml), i.e. one trace
row per operation with μ ∈ {0,1}. The impl deduplicated ops and stored a
count in μ — a divergence from the spec (and an unsound count in a Bit-typed
column). Drop the dedup: one row per op, μ = 1 (0 for padding). MUL/DVRM keep
their dedup since the spec types those multiplicities as BaseField counts.
Also makes LT/BRANCH trace gen deterministic (no HashMap iteration order) and
aligns it with the bitwise collector (which already runs over raw ops).
chunk_and_generate built each table's chunks sequentially. Chunks are independent, so generate them with rayon (gated on the `parallel` feature); `collect` into Result<Vec<_>> preserves chunk order, so output is byte-identical. Tables are still generated one at a time (no all-tables-parallel), keeping it compatible with sequential / on-demand commit.
Avoids rehashing as the dedup map grows. Byte-identical (same dedup result).
generate_dvrm_trace called compute_remainder() ~6× per row (via n_sub_r/abs_r/ sign_r/sign_n_sub_r, each re-running the integer division). Derive sign_r, n_sub_r, sign_n_sub_r and abs_r from the single r computed up front. Byte-identical (same formulas).
Benchmark — ethrex 20 transfers (median of 3)Table parallelism: auto (cores / 3)
Commit: b4cd600 · Baseline: cached · Runner: self-hosted bench |
110aa76 to
270bb71
Compare
collect_ops_from_cpu interleaved state-dependent work (MEMW/register/commit/ keccak/ecsm — which thread memory/register state, inherently serial) with state-free work (CPU range-check bitwise lookups + CPU32/LT/SHIFT dispatch, derived purely from each logged op). Split them: the state-free chips are now collected in a parallel pass (collect_state_free_ops, rayon under the parallel feature) while the serial loop keeps only the state-threaded work. For CPU-heavy programs the per-op bitwise range-check collection is a large state-free chunk that now runs off the serial path. Output is unchanged: LT/SHIFT/CPU32 stay in program order (ordered collect); the bitwise multiplicity accumulation is order-independent.
Collaborator
Author
|
/bench 5 |
1 similar comment
Collaborator
Author
|
/bench 5 |
a9a1b15 to
04356e1
Compare
Collaborator
Author
|
/bench 5 |
e88ac71 to
04356e1
Compare
Collaborator
Author
|
/bench |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Trace-generation performance work (plus one spec-compliance fix), all in the trace builder. Tables are generated one at a time and committed in order — no all-tables-parallel — so this stays compatible with sequential / on-demand commit.
Changes
μas aBit(one row per op,μ ∈ {0,1}); the impl deduplicated and stored a count inμ. Dropped the dedup → one row per op,μ = 1. Deterministic (no HashMap order) + aligned with the bitwise collector. MUL/DVRM keep dedup (spec types those asBaseFieldcounts).collect_state_free_ops); the serial loop keeps only the state-threaded work (MEMW/register/commit/keccak/ecsm). For CPU-heavy programs the per-op bitwise collection is the big state-free chunk — this is the main mover.chunk_and_generate, byte-identical via ordered collect).with_capacity). Byte-identical.Validation
--no-default-features) + clippy (-D warnings) + fmt clean.fib_iterative_8M: prove −5.0%, heap −0.4% (low variance 2.2%). fib is addition-only (no MUL/DVRM/SHIFT), so the table-specific changes are better measured on an arithmetic-heavy program; the fib delta is the chunk + state-free-split parallelization.Notes
μhas noIS_BIT<μ>constraint — matching the spec. Provider multiplicities need no range-constraint for LogUp soundness; correctness comes from the value constraints + bus balance.