Skip to content

perf(ecsm): replace num-bigint with crypto-bigint#716

Draft
Oppen wants to merge 6 commits into
mainfrom
perf/ecsm_crypto-bigint
Draft

perf(ecsm): replace num-bigint with crypto-bigint#716
Oppen wants to merge 6 commits into
mainfrom
perf/ecsm_crypto-bigint

Conversation

@Oppen

@Oppen Oppen commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Replaces num-bigint/num-traits with crypto-bigint (already a transitive dep via elliptic-curve) in the ecsm crate
  • AffinePoint coordinates and the scalar k are now U256 (fixed-size, no heap allocation)
  • Witness quotient computation uses U1024 intermediates — necessary because the shifted-quotient offset 3p² exceeds U512
  • All tests rewritten without num-bigint; test-only reference_field.rs uses DynResidue for modular Fp arithmetic
  • Follow-up to the note in abfb0cb: "port the field/curve primitives we need to drop the num-bigint reference path entirely"

Test plan

  • cargo test -p ecsm — 15/15 pass
  • make test — all non-CUDA tests pass (CUDA failures are pre-existing on this machine)
  • /bench on CI to measure impact

AffinePoint coordinates, the scalar k, and all quotients in the witness
are now fixed-size crypto-bigint types (U256/U512/U1024) instead of
heap-allocated BigUint/BigInt.

- curve.rs: AffinePoint{x,y} and lambda are U256; k256 bridge uses
  to_be_bytes()/from_be_slice(); msb_position and schedule use
  bits_vartime()/bit_vartime()
- lib.rs: p(), n() return U256; prepare() uses U256; to_le_32() removed
  (callers use .to_le_bytes() directly)
- witness.rs: quotient computation uses U1024 intermediates (needed
  because 3p² exceeds U512); shifted_quotient takes explicit pos/neg
  U1024 terms and an offset of r²; carries and limb arrays unchanged
- tests: rewritten without num-bigint; reference_field.rs uses
  DynResidue for Fp arithmetic; prover ecdas_tests.rs adapted to U256
@Oppen

Oppen commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bench

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Benchmark — ethrex 20 transfers (median of 3)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 79957 MB 80227 MB +270 MB (+0.3%) ⚪
Prove time 50.577s 49.859s -0.718s (-1.4%) ⚪

✅ No significant change.

✅ Low variance (time: 1.2%, heap: 1.5%)

Commit: cb14979 · Baseline: built from main · Runner: self-hosted bench

Oppen added 5 commits June 25, 2026 19:08
- p() and n() are now const fn (using from_be_hex)
- P_SQ (p²) and THREE_P_SQ (3p²) are U1024 constants, eliminating the
  runtime square_wide + widening that happened on every compute_witness call
- shifted_quotient takes pos/neg as U1024; callers widen via w1024() only
  where needed (products stay as U512, only sums that could overflow 2^512
  are widened before addition)
- Test p_sq_constants_match_computed verifies P_SQ and THREE_P_SQ against
  runtime computation of p.square_wide()
…Fp params

- crypto-bigint 0.5.5 → 0.7.5, k256 0.13 → 0.14.0-rc.14
- mul_wide → widening_mul, square_wide → widening_square
- to_le_bytes() now returns EncodedUint; add .into() at call sites
- EncodedPoint/FieldElement no longer at k256 root; bridge in curve.rs
  replaced with Sec1Point / FromSec1Point / ToSec1Point
- DynResidue/DynResidueParams → FixedMontyForm/FixedMontyParams in 0.7.5
- Fp in curve.rs and reference_field.rs now use const_monty_params! macro,
  computing Montgomery reduction constants (R², R³, mod_neg_inv) at
  compile time rather than on every field operation
…_quotient

The shifted quotient r + num/p has num < p² < 2^512 and r ≤ 3p < 2^258.
Using I576 = Int<9> (9 × 64-bit limbs) rather than U1024 (16 limbs) halves
the limb count for the division, reducing schoolbook div work from 16-limb
to 9-limb arithmetic.

- shifted_quotient now takes pos/neg as Uint<9> and computes num = pos - neg
  directly as a signed I576, divides by p as NonZero<Int<5>> (p < 2^320),
  then adds r_offset (p or 3p as Uint<5>) to produce the positive quotient
- All intermediate sums that can exceed 2^512 (2λyA, 3xA²) are now computed
  in Uint<9> to avoid overflow; U512 is only used for single products < p²
- P_SQ / THREE_P_SQ constants removed (no longer needed)
- R_BYTES (3p, 33 LE bytes) used directly to construct the Uint<5> offset
- Hoist every per-step modulus operand to a compile-time const: p, 3p,
  p as NonZero<Int<5>>, the U512 divisor, and the zero-extended p/3p/b
  limb arrays. build_step(s) now takes no runtime modulus parameters.
- Merge shifted_quotient + to_le_33 into a direct [u8;33] return,
  dropping the intermediate U512 round-trip.
- Replace Fermat pow(p-2) inversion with crypto-bigint safegcd invert().
- Materialize p/n as const P and const N (single compile-time eval).
- Shrink the convolution/term/ext limb arrays from i128 to i32 (terms
  stay below 2^25; ~64x headroom), cutting build_step live limb working
  set ~4x (~8KiB to ~2KiB). Public carry arrays stay [i64;64].
- Use checked_div_rem_vartime (witness gen is already variable-time).
…rsions)

- batch_invert: acc *= / inv *= instead of x = x * y
- msb_position: drop redundant u32 cast on bits_vartime()
- drop useless .into() on checked_div_rem quotient (already Int<9>)
@Oppen

Oppen commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant