perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer by samuelburnham · Pull Request #442 · argumentcomputer/ix

samuelburnham · 2026-06-12T16:19:50Z

This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committed failures publics word verified zero.

Key Results

Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×); Int32/Int64 variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independent
Cycle suite total: 17.53 B → 5.77 B steps (3.04×, −67.1%)
Environment-machine WHNF (final two kernel commits): −19.3% suite net; Vector.extract_append._proof_1 −32.4%; the Init Array/Vector Extract proof family −22% to −33% each (five multi-billion-step constants)
Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual (28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cycles
Kernel hardened against adversarial inputs: term identity moved off per-node content hashing with the threat model documented (docs/kernel_identity.md), blob bytes verified at load in all deserializers

input	base (#411)	this PR	Δ	speedup
_private.Init.Data.Range.Polymorphic.SInt.0.Int16.instRxcHasSize_eq	5,697,166,200	42,012,168	−99.3%	135.6×
_private.Init.Data.Vector.Extract.0.Vector.extract_append._proof_1	7,690,887,211	3,344,459,375	−56.5%	2.3×
List.mergesort synthetic env	4,005,320,347	2,290,310,593	−42.8%	1.7×
Nat.add_comm	18,690,815	11,532,048	−38.3%	1.6×
Array.binSearchAux._unary	95,800,411	62,862,965	−34.4%	1.5×
Nat.gcd_comm	6,669,821	5,550,752	−16.8%	1.2×
Batteries.RBMap	13,940,412	12,321,092	−11.6%	1.1×
String.append	4,560,304	4,070,450	−10.7%	1.1×
total	17,533,035,521	5,773,119,443	−67.1%	3.04×

Fifteen Commits (oldest first)

bench: guest cycle harness + native single-constant check
- bench-cycles.sh suite over dumped guest stdins with failures-word validation; check_one example mirrors the guest's reuse-mode check natively
kernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)
- Nat.add x lit / Nat.div|mod x k stay stuck as compact offsets instead of materializing succ^n(x) chains or the division algorithm; bulk offset def-eq decides offset pairs directly
- The Int16.instRxcHasSize_eq collapse (5.70 B → 56 M) and the UTF-8-codec-class fix
kernel: intern-assigned uids replace per-node blake3 content hashing
- Term identity becomes intern-table-assigned sequential u64s; shallow structural keys for hash-consing; blake3 stays only at the Ixon boundary (content addresses, Merkle roots, proof-carrying-code claims)
- The broadest single win: −30–36% on every reduction-heavy input
kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops
- Constant heads classified once per address (Native/BitVec/Nat/Decidable/Str); at most one family recognizer runs per iteration instead of the five-probe gauntlet
ixon: defer per-constant address verification to first materialization
- Constants verify on first get(); blob bytes still verified eagerly at load
bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite
docs: kernel uid identity vs Ixon content addressing
- Two-layer identity model and why uid collisions are not an attack surface (uids are assigned, never computed from input)
kernel: fix PrimFamily visibility/qualification warnings
kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps
- Allocation-free transient-nat spine probe ahead of the cache lookup; caps on Nat primitive output sizes
kernel+ixon: harden term identity against adversarial inputs
- Blob bytes verified against their address at load in all three deserializers; structural equality audited against the adversarial model
kernel: privatize uid-accepting constructors; document cross-shard uid story
docs: design for the environment-machine WHNF port
- Full design for the Krivine-style machine (ported from the proven IxVM implementation), including the exit-to-outer-loop contract and the known regression class
kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)
- whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_subst readback), so work is proportional to what the reduction consumes
- Every exit re-enters the existing loop: ambient zeta, iota, prim dispatch, cheap-mode def-eq flags, and all caches stay byte-identical
kernel: closure-iota at the machine's recursor exit (Phase B)
- Recursor spines consumed lazily: only the major premise materializes; params/motives/minors and post-major args ride through as closures, so unselected minors (dropped match/Decidable branches) are never substituted and never read back
- K recursors, literal majors, and struct-eta deliberately miss to the unchanged plain path, preserving the Nat transient-work and linear-rec/offset shortcuts exactly
docs: env-machine WHNF design is implemented (Phases A+B)

Move Rust sources from src/ into per-crate trees under crates/ (aiur, common, compile, ffi, ixon, kernel) and convert the root Cargo.toml into a workspace manifest with shared dependencies, lints, and profiles.

Fix CI Fix licenses Updates Fixups

The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not `quickcheck`. Without this import the ixon test target fails to compile.

Re-home jcb's monolithic `crate::ix::{profile,shard}` into `crates/kernel` (ProfileSink must be reachable from KEnv) and rewrite imports to the workspace convention (ix_common::address, ixon::merkle, crate::{profile,shard}, ix_kernel::{profile,shard} from the FFI). No logic changes to the ported code. ix-kernel builds; all 12 shard partitioner tests + the profile-sink delta-edge test pass. (Pre-existing, unrelated: ix-ffi fails to build on kernel-riscv too — private IOBuffer fields in aiur/protocol.rs from the workspace split.)

…olding Replace the static range/byte partition heuristics with the offline profiler/partitioner plan as the primary sharding path: * --shard-plan <manifest.ixes>: prove a `.ixes` partition (from `ix profile` + `ix shard`, i.e. real per-block heartbeats balanced + cross-shard delta-ingress minimized). Each manifest shard → one closure-injected leaf proof; leaves tree-fold into one aggregate whose committed subject is bound to the env. Maps manifest block addresses to work items by ingress-block (a Muts work item's `primary` is a member projection, not the block addr), validates full coverage, and composes with --execute and --only-shard. * New `ix-kernel` example `shard_plan`: profile a .ixe + partition → .ixes, out of circuit, without the Lean/FFI layer (handy until the pre-existing ix-ffi build break is fixed). Validated end-to-end in --execute on nataddcomm: 34 work items → 4 balanced shards (heartbeats 78–120), full coverage, 0 failures. Removed as superseded by the planner: * --topo + topo_order_items (the graph cut subsumes within-env scheduling) * --closure-stats / closure_report (manifest reports exact cross_ingress) * --overlap-stats / overlap_report (Design-C measurement scaffolding) The host `shard` partitioner uses rayon + std::time, so the module is gated to non-riscv targets (it runs in the host planner, never the guest); `profile` stays available everywhere since KEnv holds a ProfileSink (always None on the guest). --shard-consts/--shard-bytes remain as no-profile fallbacks.

Add `partition_for_cycle_cap(profile, max_cycles, cycles_per_heartbeat, eps)`: pick the smallest N whose heaviest shard fits a Zisk guest-cycle budget, then partition. Converts the cap to a heartbeat cap (cycles ÷ cycles/hb), estimates N from total load, and grows N proportionally to any overshoot — re-partitioning each round (cheap) to converge on the minimal feasible N. Detects the atomic-block floor: when the largest mutual block alone exceeds the cap, flags `infeasible_atomic_floor` (no N can fit it; split upstream or raise the cap). The cap is the leaf prover's trace size, which sets peak prover RAM; doc gives the conversion from a host-RAM budget (peak_GB ≈ 45 + 32·cycles_B, measured on this prover) so users think in RAM, pass cycles, and get N. Wire it into the `shard_plan` example: `--max-cycles C` (alternative to `--shards N`), `--cycles-per-heartbeat K` (default 208k, mergesort-calibrated; recalibrate per env via one --execute). Verified on mergesort: 4.5e9 → N=2 (heaviest 3.9B cyc), 2e9 → N=5 (1.76B cyc); largest atomic block 2553 hb.

…→215k) Measured whole-env Zisk cycles ÷ profiler heartbeats across 12 envs: - large shardable envs (>20k hb) cluster 194-208k whole-env - per-shard ratio runs ~5-8% higher (a shard memoizes less than the whole env); mergesort's heaviest shard is ~216k - tiny arithmetic envs ~130-160k; heavy-def-eq envs (array/string-assoc) ~258k The budget needs the per-shard heaviest ratio (so it never under-predicts a leaf's cycles → never OOMs). Set the default to 215k (matches the measured mergesort heaviest shard; predicts 4.03B vs 4.042B actual). Genuinely workload-dependent, so the doc + --cycles-per-heartbeat recalibration guidance stand; the cap's RAM headroom absorbs the residual. mergesort still → N=2.

Add cycle_cap_for_ram(ram_gb): invert the measured single-leaf prover model (peak_GB ≈ 45 + 32·cycles_B) at ~78% of RAM to a per-shard cycle cap; 0 if the box can't hold the ~45 GB prover base. Compose with the existing partition_for_cycle_cap so RAM → cap → N from total heartbeats. shard_plan example: default mode (no flags) reads /proc/meminfo and sizes N automatically — the user just passes the .ixe. --shards / --max-cycles remain as explicit overrides; --ram-gb for what-if. Verified: 250 GB → N=2 (mergesort), 64 GB → N=47 + atomic-floor warning (its biggest block won't fit), 40 GB refused.

The recursive min-cut partitioner now returns its bisection tree (AggNode: leaves are shard ids, internal nodes are min-cut splits) via partition_with_tree; rec_bisect builds it as it recurses, so the tree mirrors the partition exactly. agg_plan lowers the binary tree to an arity-bounded post-order fold plan (FoldOp::Leaf/Agg), collapsing whole subtrees so each agg call folds up to `arity` children — agg-call count stays ~N/(arity-1), same as a flat fold, while children remain subtree-aligned and in shard order (keeping the subject merkle-fold left-associative). The tree is serialized as an optional trailing .ixes section: pre-tree manifests decode to tree=None and provers fall back to a flat fold. partition_for_cycle_cap and shard_esp populate it; the shard_plan example writes it (omitted for --naive baselines). Folding along this tree lets aggregation discharge each cross-shard assumption at the lowest common ancestor of the shards that share it, where the assumption sets are smallest.

The agg guest gains a mode word. Mode 0 is the legacy O(1) merkle_join fold of opaque subject/assumption roots (used by the flat fallback, range, and reuse paths — unchanged semantics). Mode 1 implements set discharge: the host supplies each child's subject/assumption preimage address lists as untrusted witness; the guest re-derives every list's canonical merkle root and asserts it equals the child's committed root, then commits canonical roots over union(subjects) and union(assumptions) \ union(subjects). Assumptions proven by any sibling are discharged in-circuit at the lowest level of the tree where producer and consumer meet; when the whole env folds, the final assumptions root collapses to the zero sentinel and the aggregate claim is unconditional. Mode-1 outputs are canonical set commitments, so agg-of-agg witness verification is uniform at every level, and the mode is committed in field_b. The shard-plan host path folds along the manifest's bisection tree (pruned to the shards actually proven), executes the arity-collapsed plan, supplies the discharge witness per agg call, mirrors the union/discharge host-side, binds the root subject to the canonical root over the covered union, and binds the root assumptions to the mirror — reporting closure ("unconditional") or the outstanding count. Each leaf additionally checks the host's assumption mirror against the guest's committed root at prove time, so a rule divergence fails fast rather than as a witness panic deep in the agg. Measured on mergesort (8 shards, identical leaves, warm GPU): discharge adds 2.5s to the single arity-16 agg call (19.0s -> 21.5s, +0.9% total run) and resolves all 2439 cross-shard assumptions to zero. Also: the Assembly executor is now the default (--emulator opts out). It is ~4-6x faster at trace generation and a prerequisite for the hints stream. The historical multi-program breakage (second setup() clobbers the first program's ASM ROM histogram -> rom.rs:178 panic) is fixed by ordering: the agg program is set up only after the leaf loop, never before; the Emulator keeps the upfront setup. Validated end to end on single-leaf, balanced (4-leaf), unbalanced mixed-vk (3-leaf), and 8-shard runs.

Simplifications, no behavior change (validated: the 3-shard discharge prove reproduces the same leaf steps, discharge counts, and final subject root as before): - Drop the flat mode-0 fallback fold from run_shard_plan. It was reachable only via manifests written before the tree section existed, which regenerate in seconds; aggregation now requires the bisection tree and says so. One fold path instead of two. - Move tree pruning into the kernel as AggNode::prune (where the type lives), with the unit test it was missing. - Drop the leaf_subjects vector (redundant with leaf_subject_sets) and the duplicate cover->Address conversion in the leaf loop. - Extract discharge() — the host mirror of the agg guest's union/resolve — out of the fold loop. - Name build_inputs' return triple (LeafInputs) and tighten build_plan's frontier expansion. - Zero compiler and clippy warnings across ix-kernel and zisk-host (was 12): unused mut, documented dead-code allow on the publics layout field, redundant to_string in format args, indexed loop, then -> then_some. Audited and kept: every CLI flag is read; subject_of_cover, fold_subjects, and tree_partition remain in use by the only-const, range, and reuse paths; the agg guest's two-mode structure is already minimal.

…euse path --store-dir now composes with --shard-plan (and requires it). A manifest shard whose certified targets are all covered by stored proofs is not re-proven: the covering stored proofs are folded into the aggregate as extra children, where the agg guest verifies them (vk-pinned, witness-root-checked) and discharges assumptions against their subjects — reuse by verification, never by trust. Partially-covered shards are re-proven whole (the shard is the proof unit). Freshly proven shards are banked per-shard (proofs/<n>.{proof,cover,asm}) as they complete, so an interrupted run resumes where it left off; .asm carries the assumption preimages mode-1 discharge needs, and the proofs directory is the store's single source of truth (proven.bin is gone). The fold is one op-list executor: novel leaves fold along the pruned bisection tree as before; stored proofs join as extra slots; a synthesized tail fold reduces the forest to one verified root, with an agg-of-1 when a lone stored proof would otherwise go unverified. Mode-1 commitments are canonical sets, so fold shape never changes the final claim — validated: fresh-store, full-reuse (0 leaves proven), and mixed (novel tree + stored tail) runs all commit the identical subject root and discharge to an unconditional claim. The legacy standalone reuse path (mode-0 join fold, naive count/byte chunking, proven.bin) is deleted, along with its private flags (--closure-inject, --limit-items) and now-orphaned helpers (fold_subjects, plan_chunks_by_closure, addr_bytes, load_store, save_store). --require-closed derives the covered set from the proof index. Net -392 lines; zero compiler/clippy warnings.

…reuse The planner (shard_plan --store-dir) now diffs the env against the proof store before profiling: work items whose targets are all store-covered are skipped entirely — not typechecked (planning gets faster), not partitioned, and their blocks excluded from the hypergraph (a novel→covered delta edge is an assumption discharged at aggregation, not a cut to minimize). The manifest covers only novel work. The host accepts such manifests: grouping requires every work item to be either manifest-assigned or store-covered, and the covering-proof selection generalizes from "reused manifest shards" to "every needed target the novel shards won't certify" — subsuming reused shards and unmanifested covered work. A fixpoint additionally folds stored proofs that cover the assumptions of already-included stored proofs, so a closed store yields an unconditional claim even cross-env. This fixes the cross-env conservatism where the min-cut objective pulled already-proven library constants into novel shards (clustering dependents with dependencies) and one novel constant forced a whole mixed shard to re-prove. Validated: with nataddcomm proven into a store, planning natmulcomm skips 34 of 47 work items, proves 2 novel shards (13 items), folds the 3 stored proofs (verified in-circuit, assumption-closure fixpoint), and discharges to an unconditional claim — 72% of the work reused across envs. Same-env full reuse regression unchanged (identical subject root).

The open-assumption filter scanned stored×subjects per assumption per round — O(assumptions × proofs × subjects), seconds-to-minutes of host CPU at init-store scale (~206 proofs × 51k subjects). One HashSet of included-proof subjects per round makes it linear and reads better. Re-validated: full-reuse fold of the cross-env store commits the identical subject root, outstanding 0, unconditional.

…_utf8EncodeChar_append`

zisk/scripts/bench-cycles.sh runs the standard dumped-input suite (~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies the committed `failures` publics word is zero. check_one mirrors the Zisk guest's one-work-item path natively so IX_* instrumentation can target a single expensive constant.

Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form instead of delta-unfolding into succ^n towers / the division algorithm. Iota over such majors already works via cleanup_nat_offset_major. - linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n iota steps. - def-eq (try_def_eq_offset): decompose both sides to base + offset (literals, succ layers, add-lit read in O(1)) and strip the shared offset in ONE step. The previous one-succ-peel recursed through full is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets. Guest cycles (ziskemu, vs d2ea3d9 baseline): Int16.instRxcHasSize_eq 5,697,166,200 -> 56,028,269 (-99.0%) binsearch 95,800,411 -> 87,735,518 (-8.4%) Nat.add_comm 18,690,815 -> 17,805,210 (-4.7%) mergesort shard 4,005,320,347 -> 3,920,521,476 (-2.1%) Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0. 604 kernel tests pass (one test updated: whnf of `x + 2` now asserts the compact stuck form; new test covers tower-vs-compact def-eq).

Every KExpr/KUniv construction used to compute a blake3 hash of its content for identity (intern key, cache keys, equality) — ~20% of guest cycles on reduction-heavy constants (app_hash 22-33% cumulative). Identity is now an intern-assigned u64 uid: - uids come from a process-global atomic counter and are NEVER reused, so uid equality implies structural equality and cache keys cannot alias across intern-table clears (a stale key can only miss). - InternTable keys on shallow structural keys (ExprKey/UnivKey): variant tag + child uids + semantic payload, mirroring exactly what the old content hash covered (display names, binder info, mdata excluded). intern_expr/intern_univ canonicalize children recursively when handed nodes built outside the table, preserving the old "structural identity" interning semantics. - PartialEq for KExpr/KUniv is now structural with the uid fast path (canonical-vs-canonical compares O(1)); hash_eq remains the fast-but-incomplete uid check for cache/quick-path callers where a false negative only costs a fallback. - Cache ctx components stay blake3 (CtxAddr) but hash uids instead of content addrs; push_local/push_let intern their frame types so equal context suffixes share cache entries. - ingress's hash-first intern probe becomes key-first (same skip-the- construction win, no hash). Soundness note: affirmative identity (equal uid => equal term) is what def-eq quick paths and caches rely on, and it holds by construction; incomplete negatives only cost cache misses. Nat/Str literals keep their blake3 blob Addresses (ingress-meaningful). Guest cycles (ziskemu, vs previous commit): mergesort shard 3,920,521,476 -> 2,526,226,208 (-35.6%) Vector.extract_..._1 7,692,517,694 -> 5,369,879,578 (-30.2%) Nat.add_comm 17,805,210 -> 12,687,822 (-28.7%) binsearch 87,735,518 -> 64,562,040 (-26.4%) Int16.instRxc..._eq 56,028,269 -> 45,024,529 (-19.6%) natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7% 604 kernel tests pass (16 hash-determinism tests updated to assert the preserved property — identity ignores names/bi/mdata — via structural equality, shallow-key equality, or shared-table interning).

The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed all five primitive recognizers (native/bitvec/nat/decidable/string) every iteration; each collects its own app spine and runs its own gauntlet of 32-byte address compares — pure tax on ordinary constant-headed terms. Classify the head once per iteration via an allocation-free app-chain walk + per-address memo (KEnv::prim_family_cache), and call at most the one matching family reducer. Family head-address sets are disjoint, so dispatch order semantics are unchanged. The nat-offset-stuck probe gates on the Nat family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur, measured there as part of -28% on Vector._proof_2). Guest cycles (ziskemu, vs previous commit): mergesort shard 2,526,226,208 -> 2,399,597,758 (-5.0%) Vector.extract_..._1 5,369,879,578 -> 5,112,313,900 (-4.8%) Nat.add_comm 12,687,822 -> 12,249,578 (-3.5%) Int16.instRxc..._eq 45,024,529 -> 43,835,207 (-2.6%) binsearch 64,562,040 -> 64,095,836 (-0.7%) others ~0 (+25 steps cache-population noise on two inputs) 604 kernel tests pass.

Env::get_anon paid one blake3 content-hash per constant at parse time. The binding check moves into LazyConstant::get() (pending_addr field): constants shipped in a closure but never forced by the typechecker are never hashed, while everything the kernel certifies is still verified on materialization — checking a constant forces it. The compile-side from_constant path (pre-populated cache) is exempt as before. Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%, Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector +0.01% (re-verification of the few constants materialized more than once). 200 ixon + 604 kernel tests pass (tampering test updated to assert rejection at get() instead of at parse).

Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).

Records the two-identity-layer boundary after 1e3029d: constants/blobs keep their blake3 Ixon Addresses (claims, Merkle roots, proof store, cross-run reuse, and per-constant inclusion proofs are all unchanged), while the removed blake3 hashing was the kernel-internal per-node scheme used only for intern/cache identity inside one checker run — never serialized, never part of a proof artifact. Documents the soundness argument (equal uid => equal term, never reused), the boundary rule (uids must not escape into serialized artifacts), and what a future sub-constant commitment feature would need to recompute at the egress boundary.

PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.

H-15 (perf): the transient-nat probes run on every whnf/whnf_core call BEFORE the cache lookup; they paid two collect_app_spine Vec allocations plus a KConst::Recr clone per call on the overwhelmingly common non-Nat-recursor path. Pre-filter with an allocation-free spine_head_and_len; spine collection + Recr lookup now happen only for actual Nat-recursor heads. Also drop the redundant rules.clone() in the iota IotaInfo build (rules is already owned from the try_get_const clone). H-12 (hardening): output-size cap (2^26 bits) on natively computed shiftLeft/pow results — a ~12-byte adversarial term could otherwise demand a multi-GiB allocation (cycle explosion in-guest). Oversized results stay stuck, like the existing pow exponent guard. Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to kernel-riscv. Guest cycles (ziskemu, vs previous commit): mergesort shard 2,399,808,204 -> 2,313,445,541 (-3.6%) Vector.extract_..._1 5,111,066,294 -> 4,949,833,244 (-3.2%) Nat.add_comm 12,217,888 -> 11,798,013 (-3.4%) Int16/32/64 rxc ~42.2M -> ~41.3M (-2.1%) binsearch 63,115,445 -> 62,640,316 (-0.75%) small inputs ~0. 604 kernel tests pass.

Closes the adversarial-binding review of the uid-interning design: - ixon: blob bytes are blake3-verified against their address at load in all three env deserializers (ports jcb/fixes H-1). This is the one place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and literal structural equality key on the blob Address (mirroring the old content hash), which is sound only if address binds bytes. Costs +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes; constants stay lazily verified). - kernel: fresh_uid aborts on counter exhaustion instead of wrapping (>2^67 guest cycles to reach; identity soundness should not rest on "probably won't happen"). - kernel: literal structural-equality arms compare values as well as blob addresses (defense in depth — redundant under the load-time check, degrades to inequality if that invariant were ever violated). - kernel: intern hits assert structural equality in debug builds. - docs/kernel_identity.md: adversarial-model section — why uid identity cannot be cache-poisoned (uids are assigned by a sequential counter, never computed from content, so there is no hash function to collide; intern keys are exact structural Eq, never digests; cache hits require uid equality; map-bucket collisions cost a key compare, not a wrong hit). 604 kernel + 200 ixon tests pass.

…d story The *_mdata_with_addr constructors were the only API through which a caller-supplied (hence potentially duplicated) uid could enter a node; nothing outside expr.rs uses them since the ingress key-first rework, so make them private — uid uniqueness per process is now enforced by visibility, not convention. Docs: spell out the within-shard (impossible by assignment) vs between-shards (certain and harmless; uids never cross the process boundary) collision analysis.

Krivine-style machine for the structural WHNF fragment, ported from the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry, b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector -17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM side doesn't have: cheap_proj/cheap_rec threading, fuel retention (adversarial posture), ambient let/LDecl zeta via the exit-to-outer- loop contract, iterative loop for stack discipline, clo_subst readback with lbr guards + per-call scratch, unchanged cache/uid semantics. Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff), with the known curried-sharing regression class added to the bench suite before implementation.

…beta path) Port of the IxVM environment machine (~/ix-aiur d0d3375; design in docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style machine when a beta fires: beta/zeta become O(1) environment pushes onto a persistent cons-list of closures, and substitution materializes only at machine exits (clo_subst readback) — work proportional to what the reduction consumes, not to every beta performed. A beta chain ending in another beta never materializes its intermediate bodies. - subst.rs: Clo/MEnv closure types (carried env length; per-Clo OnceCell readback memo standing in for Aiur's global memoization), clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo from a new InternTable scratch pool, results interned like subst. - whnf.rs: iterative machine_whnf loop + machine_exit, replacing the multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at the eager path's one-substitution-event granularity. Every exit reads back and re-enters the outer loop, so ambient let/LDecl zeta, iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are unchanged; all caches and keys untouched. - bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class sentinels for the known curried-sharing regression class). 609 kernel + 200 ixon tests pass; native check_one true on Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck, Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11 envs. Guest cycle benchmarks pending (box busy).

Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit no longer reads its spine back when the head is a recursor on the main ctor-rule path: only the major premise materializes (readback + the same full whnf the eager path uses), and on a hit the rule RHS re-enters the machine with params/motives/minors and post-major arguments as their ORIGINAL closures plus the ctor fields wrapped closed. The rule's Lam-chain betas push them straight into an environment, so unselected minors — dropped match/Decidable branches, the UTF-8 codec class — are never substituted and never read back. Off-main-path cases miss to the plain readback exit by construction, preserving semantics exactly: K recursors (ctor synthesis needs infer + def-eq), Nat literal majors (transient-work cache discipline and the linear-rec/offset shortcuts stay on the plain path), Str literal majors, struct-eta candidates, and stuck majors. cheap_rec mode skips machine-iota entirely, mirroring the eager gating. A miss costs one readback of the major; the plain path's whnf of it then hits a warm cache. Machine-native delta (IxVM Phase C1.5) does NOT port at this layer: their machine lives in a whnf that includes delta, while ours sits inside whnf_core, which must stay delta-free for def-eq's lazy-delta unfold ordering. Spanning the whnf delta loop with closures is a separate, larger restructure. 610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test covers the spine arithmetic incl. post-major shift); native check_one true on the usual five constants; perf_check whole-env failure parity on 11 envs. Guest cycle benchmarks pending alongside Phase A's.

map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.

samuelburnham added 30 commits June 12, 2026 15:46

refactor: Convert to Cargo workspace with subcrates

d8af326

Move Rust sources from src/ into per-crate trees under crates/ (aiur, common, compile, ffi, ixon, kernel) and convert the root Cargo.toml into a workspace manifest with shared dependencies, lints, and profiles.

feat: Add SP1 and Zisk kernel proving backends

54b19bc

Fix CI Fix licenses Updates Fixups

fix: Import quickcheck_macros::quickcheck in ixon merkle tests

880d5de

The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not `quickcheck`. Without this import the ixon test target fails to compile.

Updates

2d45433

fix: Add NatSuccMode::Stuck cache to prove `ByteArray.utf8DecodeChar?…

735d07c

…_utf8EncodeChar_append`

chore: apply cargo fmt

bf449db

chore: fix clippy warnings in shard examples

017b77c

bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite

2ed9b04

Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).

kernel: fix PrimFamily visibility/qualification warnings

1781268

PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.

samuelburnham added 5 commits June 12, 2026 16:32

docs: env-machine WHNF design is implemented (Phases A+B)

4e893e6

chore: apply cargo fmt

96cd080

chore: fix clippy warnings (workspace, all targets)

4d2452d

map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.

samuelburnham force-pushed the sb/kernel-perf branch from fbf4c7a to 4d2452d Compare June 12, 2026 16:41

samuelburnham force-pushed the kernel-riscv branch 3 times, most recently from 1076c91 to 73be557 Compare June 25, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
samuelburnham wants to merge 35 commits into
kernel-riscvfrom
sb/kernel-perf

samuelburnham commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

samuelburnham commented Jun 12, 2026

Key Results

Fifteen Commits (oldest first)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant