Skip to content

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442

Open
samuelburnham wants to merge 35 commits into
kernel-riscvfrom
sb/kernel-perf
Open

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
samuelburnham wants to merge 35 commits into
kernel-riscvfrom
sb/kernel-perf

Conversation

@samuelburnham

Copy link
Copy Markdown
Member

This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committed failures publics word verified zero.

Key Results

  • Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×); Int32/Int64 variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independent
  • Cycle suite total: 17.53 B → 5.77 B steps (3.04×, −67.1%)
  • Environment-machine WHNF (final two kernel commits): −19.3% suite net; Vector.extract_append._proof_1 −32.4%; the Init Array/Vector Extract proof family −22% to −33% each (five multi-billion-step constants)
  • Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual (28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cycles
  • Kernel hardened against adversarial inputs: term identity moved off per-node content hashing with the threat model documented (docs/kernel_identity.md), blob bytes verified at load in all deserializers
input base (#411) this PR Δ speedup
_private.Init.Data.Range.Polymorphic.SInt.0.Int16.instRxcHasSize_eq 5,697,166,200 42,012,168 −99.3% 135.6×
_private.Init.Data.Vector.Extract.0.Vector.extract_append._proof_1 7,690,887,211 3,344,459,375 −56.5% 2.3×
List.mergesort synthetic env 4,005,320,347 2,290,310,593 −42.8% 1.7×
Nat.add_comm 18,690,815 11,532,048 −38.3% 1.6×
Array.binSearchAux._unary 95,800,411 62,862,965 −34.4% 1.5×
Nat.gcd_comm 6,669,821 5,550,752 −16.8% 1.2×
Batteries.RBMap 13,940,412 12,321,092 −11.6% 1.1×
String.append 4,560,304 4,070,450 −10.7% 1.1×
total 17,533,035,521 5,773,119,443 −67.1% 3.04×

Fifteen Commits (oldest first)

  1. bench: guest cycle harness + native single-constant check

    • bench-cycles.sh suite over dumped guest stdins with failures-word validation; check_one example mirrors the guest's reuse-mode check natively
  2. kernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)

    • Nat.add x lit / Nat.div|mod x k stay stuck as compact offsets instead of materializing succ^n(x) chains or the division algorithm; bulk offset def-eq decides offset pairs directly
    • The Int16.instRxcHasSize_eq collapse (5.70 B → 56 M) and the UTF-8-codec-class fix
  3. kernel: intern-assigned uids replace per-node blake3 content hashing

    • Term identity becomes intern-table-assigned sequential u64s; shallow structural keys for hash-consing; blake3 stays only at the Ixon boundary (content addresses, Merkle roots, proof-carrying-code claims)
    • The broadest single win: −30–36% on every reduction-heavy input
  4. kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops

    • Constant heads classified once per address (Native/BitVec/Nat/Decidable/Str); at most one family recognizer runs per iteration instead of the five-probe gauntlet
  5. ixon: defer per-constant address verification to first materialization

    • Constants verify on first get(); blob bytes still verified eagerly at load
  6. bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite

  7. docs: kernel uid identity vs Ixon content addressing

    • Two-layer identity model and why uid collisions are not an attack surface (uids are assigned, never computed from input)
  8. kernel: fix PrimFamily visibility/qualification warnings

  9. kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps

    • Allocation-free transient-nat spine probe ahead of the cache lookup; caps on Nat primitive output sizes
  10. kernel+ixon: harden term identity against adversarial inputs

    • Blob bytes verified against their address at load in all three deserializers; structural equality audited against the adversarial model
  11. kernel: privatize uid-accepting constructors; document cross-shard uid story

  12. docs: design for the environment-machine WHNF port

    • Full design for the Krivine-style machine (ported from the proven IxVM implementation), including the exit-to-outer-loop contract and the known regression class
  13. kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)

    • whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_subst readback), so work is proportional to what the reduction consumes
    • Every exit re-enters the existing loop: ambient zeta, iota, prim dispatch, cheap-mode def-eq flags, and all caches stay byte-identical
  14. kernel: closure-iota at the machine's recursor exit (Phase B)

    • Recursor spines consumed lazily: only the major premise materializes; params/motives/minors and post-major args ride through as closures, so unselected minors (dropped match/Decidable branches) are never substituted and never read back
    • K recursors, literal majors, and struct-eta deliberately miss to the unchanged plain path, preserving the Nat transient-work and linear-rec/offset shortcuts exactly
  15. docs: env-machine WHNF design is implemented (Phases A+B)

Move Rust sources from src/ into per-crate trees under crates/
(aiur, common, compile, ffi, ixon, kernel) and convert the root
Cargo.toml into a workspace manifest with shared dependencies, lints,
and profiles.
Fix CI

Fix licenses

Updates

Fixups
The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not
`quickcheck`. Without this import the ixon test target fails to
compile.
Re-home jcb's monolithic `crate::ix::{profile,shard}` into `crates/kernel`
(ProfileSink must be reachable from KEnv) and rewrite imports to the
workspace convention (ix_common::address, ixon::merkle, crate::{profile,shard},
ix_kernel::{profile,shard} from the FFI). No logic changes to the ported code.

ix-kernel builds; all 12 shard partitioner tests + the profile-sink
delta-edge test pass. (Pre-existing, unrelated: ix-ffi fails to build on
kernel-riscv too — private IOBuffer fields in aiur/protocol.rs from the
workspace split.)
…olding

Replace the static range/byte partition heuristics with the offline
profiler/partitioner plan as the primary sharding path:

* --shard-plan <manifest.ixes>: prove a `.ixes` partition (from `ix profile`
  + `ix shard`, i.e. real per-block heartbeats balanced + cross-shard
  delta-ingress minimized). Each manifest shard → one closure-injected leaf
  proof; leaves tree-fold into one aggregate whose committed subject is bound
  to the env. Maps manifest block addresses to work items by ingress-block
  (a Muts work item's `primary` is a member projection, not the block addr),
  validates full coverage, and composes with --execute and --only-shard.
* New `ix-kernel` example `shard_plan`: profile a .ixe + partition → .ixes,
  out of circuit, without the Lean/FFI layer (handy until the pre-existing
  ix-ffi build break is fixed). Validated end-to-end in --execute on
  nataddcomm: 34 work items → 4 balanced shards (heartbeats 78–120), full
  coverage, 0 failures.

Removed as superseded by the planner:
* --topo + topo_order_items (the graph cut subsumes within-env scheduling)
* --closure-stats / closure_report (manifest reports exact cross_ingress)
* --overlap-stats / overlap_report (Design-C measurement scaffolding)

The host `shard` partitioner uses rayon + std::time, so the module is gated
to non-riscv targets (it runs in the host planner, never the guest); `profile`
stays available everywhere since KEnv holds a ProfileSink (always None on the
guest). --shard-consts/--shard-bytes remain as no-profile fallbacks.
Add `partition_for_cycle_cap(profile, max_cycles, cycles_per_heartbeat, eps)`:
pick the smallest N whose heaviest shard fits a Zisk guest-cycle budget, then
partition. Converts the cap to a heartbeat cap (cycles ÷ cycles/hb), estimates
N from total load, and grows N proportionally to any overshoot — re-partitioning
each round (cheap) to converge on the minimal feasible N. Detects the
atomic-block floor: when the largest mutual block alone exceeds the cap, flags
`infeasible_atomic_floor` (no N can fit it; split upstream or raise the cap).

The cap is the leaf prover's trace size, which sets peak prover RAM; doc gives
the conversion from a host-RAM budget (peak_GB ≈ 45 + 32·cycles_B, measured on
this prover) so users think in RAM, pass cycles, and get N.

Wire it into the `shard_plan` example: `--max-cycles C` (alternative to
`--shards N`), `--cycles-per-heartbeat K` (default 208k, mergesort-calibrated;
recalibrate per env via one --execute). Verified on mergesort: 4.5e9 → N=2
(heaviest 3.9B cyc), 2e9 → N=5 (1.76B cyc); largest atomic block 2553 hb.
…→215k)

Measured whole-env Zisk cycles ÷ profiler heartbeats across 12 envs:
- large shardable envs (>20k hb) cluster 194-208k whole-env
- per-shard ratio runs ~5-8% higher (a shard memoizes less than the whole env);
  mergesort's heaviest shard is ~216k
- tiny arithmetic envs ~130-160k; heavy-def-eq envs (array/string-assoc) ~258k

The budget needs the per-shard heaviest ratio (so it never under-predicts a
leaf's cycles → never OOMs). Set the default to 215k (matches the measured
mergesort heaviest shard; predicts 4.03B vs 4.042B actual). Genuinely
workload-dependent, so the doc + --cycles-per-heartbeat recalibration guidance
stand; the cap's RAM headroom absorbs the residual. mergesort still → N=2.
Add cycle_cap_for_ram(ram_gb): invert the measured single-leaf prover model
(peak_GB ≈ 45 + 32·cycles_B) at ~78% of RAM to a per-shard cycle cap; 0 if the
box can't hold the ~45 GB prover base. Compose with the existing
partition_for_cycle_cap so RAM → cap → N from total heartbeats.

shard_plan example: default mode (no flags) reads /proc/meminfo and sizes N
automatically — the user just passes the .ixe. --shards / --max-cycles remain
as explicit overrides; --ram-gb for what-if. Verified: 250 GB → N=2 (mergesort),
64 GB → N=47 + atomic-floor warning (its biggest block won't fit), 40 GB refused.
The recursive min-cut partitioner now returns its bisection tree
(AggNode: leaves are shard ids, internal nodes are min-cut splits) via
partition_with_tree; rec_bisect builds it as it recurses, so the tree
mirrors the partition exactly. agg_plan lowers the binary tree to an
arity-bounded post-order fold plan (FoldOp::Leaf/Agg), collapsing whole
subtrees so each agg call folds up to `arity` children — agg-call count
stays ~N/(arity-1), same as a flat fold, while children remain
subtree-aligned and in shard order (keeping the subject merkle-fold
left-associative).

The tree is serialized as an optional trailing .ixes section: pre-tree
manifests decode to tree=None and provers fall back to a flat fold.
partition_for_cycle_cap and shard_esp populate it; the shard_plan
example writes it (omitted for --naive baselines).

Folding along this tree lets aggregation discharge each cross-shard
assumption at the lowest common ancestor of the shards that share it,
where the assumption sets are smallest.
The agg guest gains a mode word. Mode 0 is the legacy O(1) merkle_join
fold of opaque subject/assumption roots (used by the flat fallback,
range, and reuse paths — unchanged semantics). Mode 1 implements set
discharge: the host supplies each child's subject/assumption preimage
address lists as untrusted witness; the guest re-derives every list's
canonical merkle root and asserts it equals the child's committed root,
then commits canonical roots over union(subjects) and
union(assumptions) \ union(subjects). Assumptions proven by any sibling
are discharged in-circuit at the lowest level of the tree where
producer and consumer meet; when the whole env folds, the final
assumptions root collapses to the zero sentinel and the aggregate claim
is unconditional. Mode-1 outputs are canonical set commitments, so
agg-of-agg witness verification is uniform at every level, and the mode
is committed in field_b.

The shard-plan host path folds along the manifest's bisection tree
(pruned to the shards actually proven), executes the arity-collapsed
plan, supplies the discharge witness per agg call, mirrors the
union/discharge host-side, binds the root subject to the canonical root
over the covered union, and binds the root assumptions to the mirror —
reporting closure ("unconditional") or the outstanding count. Each leaf
additionally checks the host's assumption mirror against the guest's
committed root at prove time, so a rule divergence fails fast rather
than as a witness panic deep in the agg.

Measured on mergesort (8 shards, identical leaves, warm GPU): discharge
adds 2.5s to the single arity-16 agg call (19.0s -> 21.5s, +0.9% total
run) and resolves all 2439 cross-shard assumptions to zero.

Also: the Assembly executor is now the default (--emulator opts out).
It is ~4-6x faster at trace generation and a prerequisite for the hints
stream. The historical multi-program breakage (second setup() clobbers
the first program's ASM ROM histogram -> rom.rs:178 panic) is fixed by
ordering: the agg program is set up only after the leaf loop, never
before; the Emulator keeps the upfront setup. Validated end to end on
single-leaf, balanced (4-leaf), unbalanced mixed-vk (3-leaf), and
8-shard runs.
Simplifications, no behavior change (validated: the 3-shard discharge
prove reproduces the same leaf steps, discharge counts, and final
subject root as before):

- Drop the flat mode-0 fallback fold from run_shard_plan. It was
  reachable only via manifests written before the tree section existed,
  which regenerate in seconds; aggregation now requires the bisection
  tree and says so. One fold path instead of two.
- Move tree pruning into the kernel as AggNode::prune (where the type
  lives), with the unit test it was missing.
- Drop the leaf_subjects vector (redundant with leaf_subject_sets) and
  the duplicate cover->Address conversion in the leaf loop.
- Extract discharge() — the host mirror of the agg guest's
  union/resolve — out of the fold loop.
- Name build_inputs' return triple (LeafInputs) and tighten
  build_plan's frontier expansion.
- Zero compiler and clippy warnings across ix-kernel and zisk-host
  (was 12): unused mut, documented dead-code allow on the publics
  layout field, redundant to_string in format args, indexed loop,
  then -> then_some.

Audited and kept: every CLI flag is read; subject_of_cover,
fold_subjects, and tree_partition remain in use by the only-const,
range, and reuse paths; the agg guest's two-mode structure is already
minimal.
…euse path

--store-dir now composes with --shard-plan (and requires it). A manifest
shard whose certified targets are all covered by stored proofs is not
re-proven: the covering stored proofs are folded into the aggregate as
extra children, where the agg guest verifies them (vk-pinned,
witness-root-checked) and discharges assumptions against their subjects
— reuse by verification, never by trust. Partially-covered shards are
re-proven whole (the shard is the proof unit). Freshly proven shards are
banked per-shard (proofs/<n>.{proof,cover,asm}) as they complete, so an
interrupted run resumes where it left off; .asm carries the assumption
preimages mode-1 discharge needs, and the proofs directory is the
store's single source of truth (proven.bin is gone).

The fold is one op-list executor: novel leaves fold along the pruned
bisection tree as before; stored proofs join as extra slots; a
synthesized tail fold reduces the forest to one verified root, with an
agg-of-1 when a lone stored proof would otherwise go unverified.
Mode-1 commitments are canonical sets, so fold shape never changes the
final claim — validated: fresh-store, full-reuse (0 leaves proven), and
mixed (novel tree + stored tail) runs all commit the identical subject
root and discharge to an unconditional claim.

The legacy standalone reuse path (mode-0 join fold, naive count/byte
chunking, proven.bin) is deleted, along with its private flags
(--closure-inject, --limit-items) and now-orphaned helpers
(fold_subjects, plan_chunks_by_closure, addr_bytes, load_store,
save_store). --require-closed derives the covered set from the proof
index. Net -392 lines; zero compiler/clippy warnings.
…reuse

The planner (shard_plan --store-dir) now diffs the env against the proof
store before profiling: work items whose targets are all store-covered
are skipped entirely — not typechecked (planning gets faster), not
partitioned, and their blocks excluded from the hypergraph (a
novel→covered delta edge is an assumption discharged at aggregation,
not a cut to minimize). The manifest covers only novel work.

The host accepts such manifests: grouping requires every work item to
be either manifest-assigned or store-covered, and the covering-proof
selection generalizes from "reused manifest shards" to "every needed
target the novel shards won't certify" — subsuming reused shards and
unmanifested covered work. A fixpoint additionally folds stored proofs
that cover the assumptions of already-included stored proofs, so a
closed store yields an unconditional claim even cross-env.

This fixes the cross-env conservatism where the min-cut objective pulled
already-proven library constants into novel shards (clustering
dependents with dependencies) and one novel constant forced a whole
mixed shard to re-prove. Validated: with nataddcomm proven into a
store, planning natmulcomm skips 34 of 47 work items, proves 2 novel
shards (13 items), folds the 3 stored proofs (verified in-circuit,
assumption-closure fixpoint), and discharges to an unconditional claim
— 72% of the work reused across envs. Same-env full reuse regression
unchanged (identical subject root).
The open-assumption filter scanned stored×subjects per assumption per
round — O(assumptions × proofs × subjects), seconds-to-minutes of host
CPU at init-store scale (~206 proofs × 51k subjects). One HashSet of
included-proof subjects per round makes it linear and reads better.
Re-validated: full-reuse fold of the cross-env store commits the
identical subject root, outstanding 0, unconditional.
zisk/scripts/bench-cycles.sh runs the standard dumped-input suite
(~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies
the committed `failures` publics word is zero. check_one mirrors the
Zisk guest's one-work-item path natively so IX_* instrumentation can
target a single expensive constant.
Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three
coordinated changes:

- whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic
  base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form
  instead of delta-unfolding into succ^n towers / the division
  algorithm. Iota over such majors already works via
  cleanup_nat_offset_major.
- linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a
  SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n
  iota steps.
- def-eq (try_def_eq_offset): decompose both sides to base + offset
  (literals, succ layers, add-lit read in O(1)) and strip the shared
  offset in ONE step. The previous one-succ-peel recursed through full
  is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets.

Guest cycles (ziskemu, vs d2ea3d9 baseline):
  Int16.instRxcHasSize_eq   5,697,166,200 -> 56,028,269  (-99.0%)
  binsearch                    95,800,411 -> 87,735,518  (-8.4%)
  Nat.add_comm                 18,690,815 -> 17,805,210  (-4.7%)
  mergesort shard           4,005,320,347 -> 3,920,521,476 (-2.1%)
  Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0.

604 kernel tests pass (one test updated: whnf of `x + 2` now asserts
the compact stuck form; new test covers tower-vs-compact def-eq).
Every KExpr/KUniv construction used to compute a blake3 hash of its
content for identity (intern key, cache keys, equality) — ~20% of guest
cycles on reduction-heavy constants (app_hash 22-33% cumulative).
Identity is now an intern-assigned u64 uid:

- uids come from a process-global atomic counter and are NEVER reused,
  so uid equality implies structural equality and cache keys cannot
  alias across intern-table clears (a stale key can only miss).
- InternTable keys on shallow structural keys (ExprKey/UnivKey): variant
  tag + child uids + semantic payload, mirroring exactly what the old
  content hash covered (display names, binder info, mdata excluded).
  intern_expr/intern_univ canonicalize children recursively when handed
  nodes built outside the table, preserving the old "structural identity"
  interning semantics.
- PartialEq for KExpr/KUniv is now structural with the uid fast path
  (canonical-vs-canonical compares O(1)); hash_eq remains the
  fast-but-incomplete uid check for cache/quick-path callers where a
  false negative only costs a fallback.
- Cache ctx components stay blake3 (CtxAddr) but hash uids instead of
  content addrs; push_local/push_let intern their frame types so equal
  context suffixes share cache entries.
- ingress's hash-first intern probe becomes key-first (same skip-the-
  construction win, no hash).

Soundness note: affirmative identity (equal uid => equal term) is what
def-eq quick paths and caches rely on, and it holds by construction;
incomplete negatives only cost cache misses. Nat/Str literals keep their
blake3 blob Addresses (ingress-meaningful).

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      3,920,521,476 -> 2,526,226,208  (-35.6%)
  Vector.extract_..._1 7,692,517,694 -> 5,369,879,578  (-30.2%)
  Nat.add_comm            17,805,210 ->    12,687,822  (-28.7%)
  binsearch                87,735,518 ->    64,562,040  (-26.4%)
  Int16.instRxc..._eq      56,028,269 ->    45,024,529  (-19.6%)
  natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7%

604 kernel tests pass (16 hash-determinism tests updated to assert the
preserved property — identity ignores names/bi/mdata — via structural
equality, shallow-key equality, or shared-table interning).
The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed
all five primitive recognizers (native/bitvec/nat/decidable/string)
every iteration; each collects its own app spine and runs its own
gauntlet of 32-byte address compares — pure tax on ordinary
constant-headed terms. Classify the head once per iteration via an
allocation-free app-chain walk + per-address memo
(KEnv::prim_family_cache), and call at most the one matching family
reducer. Family head-address sets are disjoint, so dispatch order
semantics are unchanged. The nat-offset-stuck probe gates on the Nat
family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur,
measured there as part of -28% on Vector._proof_2).

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      2,526,226,208 -> 2,399,597,758  (-5.0%)
  Vector.extract_..._1 5,369,879,578 -> 5,112,313,900  (-4.8%)
  Nat.add_comm            12,687,822 ->    12,249,578  (-3.5%)
  Int16.instRxc..._eq      45,024,529 ->    43,835,207  (-2.6%)
  binsearch                64,562,040 ->    64,095,836  (-0.7%)
  others ~0 (+25 steps cache-population noise on two inputs)

604 kernel tests pass.
Env::get_anon paid one blake3 content-hash per constant at parse time.
The binding check moves into LazyConstant::get() (pending_addr field):
constants shipped in a closure but never forced by the typechecker are
never hashed, while everything the kernel certifies is still verified
on materialization — checking a constant forces it. The compile-side
from_constant path (pre-populated cache) is exempt as before.

Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%,
Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector
+0.01% (re-verification of the few constants materialized more than
once). 200 ixon + 604 kernel tests pass (tampering test updated to
assert rejection at get() instead of at parse).
Both complete at ~42M steps, matching Int16 — the nat-offset compaction
makes the cost range-independent (pre-fix, Int16's 2^16 range alone was
5.7B steps and Int32/Int64 would not have completed).
Records the two-identity-layer boundary after 1e3029d: constants/blobs
keep their blake3 Ixon Addresses (claims, Merkle roots, proof store,
cross-run reuse, and per-constant inclusion proofs are all unchanged),
while the removed blake3 hashing was the kernel-internal per-node
scheme used only for intern/cache identity inside one checker run —
never serialized, never part of a proof artifact. Documents the
soundness argument (equal uid => equal term, never reused), the
boundary rule (uids must not escape into serialized artifacts), and
what a future sub-constant commitment feature would need to recompute
at the egress boundary.
PrimFamily is referenced by the public KEnv::prim_family_cache field, so
it must be pub; drop a redundant path qualification.
H-15 (perf): the transient-nat probes run on every whnf/whnf_core call
BEFORE the cache lookup; they paid two collect_app_spine Vec allocations
plus a KConst::Recr clone per call on the overwhelmingly common
non-Nat-recursor path. Pre-filter with an allocation-free
spine_head_and_len; spine collection + Recr lookup now happen only for
actual Nat-recursor heads. Also drop the redundant rules.clone() in the
iota IotaInfo build (rules is already owned from the try_get_const
clone).

H-12 (hardening): output-size cap (2^26 bits) on natively computed
shiftLeft/pow results — a ~12-byte adversarial term could otherwise
demand a multi-GiB allocation (cycle explosion in-guest). Oversized
results stay stuck, like the existing pow exponent guard.

Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to
kernel-riscv.

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      2,399,808,204 -> 2,313,445,541  (-3.6%)
  Vector.extract_..._1 5,111,066,294 -> 4,949,833,244  (-3.2%)
  Nat.add_comm            12,217,888 ->    11,798,013  (-3.4%)
  Int16/32/64 rxc          ~42.2M    ->    ~41.3M      (-2.1%)
  binsearch                63,115,445 ->    62,640,316  (-0.75%)
  small inputs ~0. 604 kernel tests pass.
Closes the adversarial-binding review of the uid-interning design:

- ixon: blob bytes are blake3-verified against their address at load in
  all three env deserializers (ports jcb/fixes H-1). This is the one
  place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and
  literal structural equality key on the blob Address (mirroring the old
  content hash), which is sound only if address binds bytes. Costs
  +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes;
  constants stay lazily verified).
- kernel: fresh_uid aborts on counter exhaustion instead of wrapping
  (>2^67 guest cycles to reach; identity soundness should not rest on
  "probably won't happen").
- kernel: literal structural-equality arms compare values as well as
  blob addresses (defense in depth — redundant under the load-time
  check, degrades to inequality if that invariant were ever violated).
- kernel: intern hits assert structural equality in debug builds.
- docs/kernel_identity.md: adversarial-model section — why uid identity
  cannot be cache-poisoned (uids are assigned by a sequential counter,
  never computed from content, so there is no hash function to collide;
  intern keys are exact structural Eq, never digests; cache hits require
  uid equality; map-bucket collisions cost a key compare, not a wrong
  hit).

604 kernel + 200 ixon tests pass.
…d story

The *_mdata_with_addr constructors were the only API through which a
caller-supplied (hence potentially duplicated) uid could enter a node;
nothing outside expr.rs uses them since the ingress key-first rework, so
make them private — uid uniqueness per process is now enforced by
visibility, not convention. Docs: spell out the within-shard (impossible
by assignment) vs between-shards (certain and harmless; uids never cross
the process boundary) collision analysis.
Krivine-style machine for the structural WHNF fragment, ported from
the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry,
b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector
-17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM
side doesn't have: cheap_proj/cheap_rec threading, fuel retention
(adversarial posture), ambient let/LDecl zeta via the exit-to-outer-
loop contract, iterative loop for stack discipline, clo_subst readback
with lbr guards + per-call scratch, unchanged cache/uid semantics.
Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff),
with the known curried-sharing regression class added to the bench
suite before implementation.
…beta path)

Port of the IxVM environment machine (~/ix-aiur d0d3375; design in
docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style
machine when a beta fires: beta/zeta become O(1) environment pushes
onto a persistent cons-list of closures, and substitution materializes
only at machine exits (clo_subst readback) — work proportional to what
the reduction consumes, not to every beta performed. A beta chain
ending in another beta never materializes its intermediate bodies.

- subst.rs: Clo/MEnv closure types (carried env length; per-Clo
  OnceCell readback memo standing in for Aiur's global memoization),
  clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo
  from a new InternTable scratch pool, results interned like subst.
- whnf.rs: iterative machine_whnf loop + machine_exit, replacing the
  multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at
  the eager path's one-substitution-event granularity. Every exit
  reads back and re-enters the outer loop, so ambient let/LDecl zeta,
  iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are
  unchanged; all caches and keys untouched.
- bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class
  sentinels for the known curried-sharing regression class).

609 kernel + 200 ixon tests pass; native check_one true on
Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck,
Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11
envs. Guest cycle benchmarks pending (box busy).
Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit
no longer reads its spine back when the head is a recursor on the main
ctor-rule path: only the major premise materializes (readback + the
same full whnf the eager path uses), and on a hit the rule RHS
re-enters the machine with params/motives/minors and post-major
arguments as their ORIGINAL closures plus the ctor fields wrapped
closed. The rule's Lam-chain betas push them straight into an
environment, so unselected minors — dropped match/Decidable branches,
the UTF-8 codec class — are never substituted and never read back.

Off-main-path cases miss to the plain readback exit by construction,
preserving semantics exactly: K recursors (ctor synthesis needs
infer + def-eq), Nat literal majors (transient-work cache discipline
and the linear-rec/offset shortcuts stay on the plain path), Str
literal majors, struct-eta candidates, and stuck majors. cheap_rec
mode skips machine-iota entirely, mirroring the eager gating. A miss
costs one readback of the major; the plain path's whnf of it then
hits a warm cache.

Machine-native delta (IxVM Phase C1.5) does NOT port at this layer:
their machine lives in a whnf that includes delta, while ours sits
inside whnf_core, which must stay delta-free for def-eq's lazy-delta
unfold ordering. Spanning the whnf delta loop with closures is a
separate, larger restructure.

610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test
covers the spine arithmetic incl. post-major shift); native check_one
true on the usual five constants; perf_check whole-env failure parity
on 11 envs. Guest cycle benchmarks pending alongside Phase A's.
map_or_else over map().unwrap_or_else(), pass ExprKey by reference in
timed_intern_or_build, explicit clone for spec_params dedup snapshot.
@samuelburnham samuelburnham force-pushed the kernel-riscv branch 3 times, most recently from 1076c91 to 73be557 Compare June 25, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant