perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
Open
samuelburnham wants to merge 35 commits into
Open
perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442samuelburnham wants to merge 35 commits into
samuelburnham wants to merge 35 commits into
Conversation
Move Rust sources from src/ into per-crate trees under crates/ (aiur, common, compile, ffi, ixon, kernel) and convert the root Cargo.toml into a workspace manifest with shared dependencies, lints, and profiles.
Fix CI Fix licenses Updates Fixups
The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not `quickcheck`. Without this import the ixon test target fails to compile.
Re-home jcb's monolithic `crate::ix::{profile,shard}` into `crates/kernel`
(ProfileSink must be reachable from KEnv) and rewrite imports to the
workspace convention (ix_common::address, ixon::merkle, crate::{profile,shard},
ix_kernel::{profile,shard} from the FFI). No logic changes to the ported code.
ix-kernel builds; all 12 shard partitioner tests + the profile-sink
delta-edge test pass. (Pre-existing, unrelated: ix-ffi fails to build on
kernel-riscv too — private IOBuffer fields in aiur/protocol.rs from the
workspace split.)
…olding Replace the static range/byte partition heuristics with the offline profiler/partitioner plan as the primary sharding path: * --shard-plan <manifest.ixes>: prove a `.ixes` partition (from `ix profile` + `ix shard`, i.e. real per-block heartbeats balanced + cross-shard delta-ingress minimized). Each manifest shard → one closure-injected leaf proof; leaves tree-fold into one aggregate whose committed subject is bound to the env. Maps manifest block addresses to work items by ingress-block (a Muts work item's `primary` is a member projection, not the block addr), validates full coverage, and composes with --execute and --only-shard. * New `ix-kernel` example `shard_plan`: profile a .ixe + partition → .ixes, out of circuit, without the Lean/FFI layer (handy until the pre-existing ix-ffi build break is fixed). Validated end-to-end in --execute on nataddcomm: 34 work items → 4 balanced shards (heartbeats 78–120), full coverage, 0 failures. Removed as superseded by the planner: * --topo + topo_order_items (the graph cut subsumes within-env scheduling) * --closure-stats / closure_report (manifest reports exact cross_ingress) * --overlap-stats / overlap_report (Design-C measurement scaffolding) The host `shard` partitioner uses rayon + std::time, so the module is gated to non-riscv targets (it runs in the host planner, never the guest); `profile` stays available everywhere since KEnv holds a ProfileSink (always None on the guest). --shard-consts/--shard-bytes remain as no-profile fallbacks.
Add `partition_for_cycle_cap(profile, max_cycles, cycles_per_heartbeat, eps)`: pick the smallest N whose heaviest shard fits a Zisk guest-cycle budget, then partition. Converts the cap to a heartbeat cap (cycles ÷ cycles/hb), estimates N from total load, and grows N proportionally to any overshoot — re-partitioning each round (cheap) to converge on the minimal feasible N. Detects the atomic-block floor: when the largest mutual block alone exceeds the cap, flags `infeasible_atomic_floor` (no N can fit it; split upstream or raise the cap). The cap is the leaf prover's trace size, which sets peak prover RAM; doc gives the conversion from a host-RAM budget (peak_GB ≈ 45 + 32·cycles_B, measured on this prover) so users think in RAM, pass cycles, and get N. Wire it into the `shard_plan` example: `--max-cycles C` (alternative to `--shards N`), `--cycles-per-heartbeat K` (default 208k, mergesort-calibrated; recalibrate per env via one --execute). Verified on mergesort: 4.5e9 → N=2 (heaviest 3.9B cyc), 2e9 → N=5 (1.76B cyc); largest atomic block 2553 hb.
…→215k) Measured whole-env Zisk cycles ÷ profiler heartbeats across 12 envs: - large shardable envs (>20k hb) cluster 194-208k whole-env - per-shard ratio runs ~5-8% higher (a shard memoizes less than the whole env); mergesort's heaviest shard is ~216k - tiny arithmetic envs ~130-160k; heavy-def-eq envs (array/string-assoc) ~258k The budget needs the per-shard heaviest ratio (so it never under-predicts a leaf's cycles → never OOMs). Set the default to 215k (matches the measured mergesort heaviest shard; predicts 4.03B vs 4.042B actual). Genuinely workload-dependent, so the doc + --cycles-per-heartbeat recalibration guidance stand; the cap's RAM headroom absorbs the residual. mergesort still → N=2.
Add cycle_cap_for_ram(ram_gb): invert the measured single-leaf prover model (peak_GB ≈ 45 + 32·cycles_B) at ~78% of RAM to a per-shard cycle cap; 0 if the box can't hold the ~45 GB prover base. Compose with the existing partition_for_cycle_cap so RAM → cap → N from total heartbeats. shard_plan example: default mode (no flags) reads /proc/meminfo and sizes N automatically — the user just passes the .ixe. --shards / --max-cycles remain as explicit overrides; --ram-gb for what-if. Verified: 250 GB → N=2 (mergesort), 64 GB → N=47 + atomic-floor warning (its biggest block won't fit), 40 GB refused.
The recursive min-cut partitioner now returns its bisection tree (AggNode: leaves are shard ids, internal nodes are min-cut splits) via partition_with_tree; rec_bisect builds it as it recurses, so the tree mirrors the partition exactly. agg_plan lowers the binary tree to an arity-bounded post-order fold plan (FoldOp::Leaf/Agg), collapsing whole subtrees so each agg call folds up to `arity` children — agg-call count stays ~N/(arity-1), same as a flat fold, while children remain subtree-aligned and in shard order (keeping the subject merkle-fold left-associative). The tree is serialized as an optional trailing .ixes section: pre-tree manifests decode to tree=None and provers fall back to a flat fold. partition_for_cycle_cap and shard_esp populate it; the shard_plan example writes it (omitted for --naive baselines). Folding along this tree lets aggregation discharge each cross-shard assumption at the lowest common ancestor of the shards that share it, where the assumption sets are smallest.
The agg guest gains a mode word. Mode 0 is the legacy O(1) merkle_join
fold of opaque subject/assumption roots (used by the flat fallback,
range, and reuse paths — unchanged semantics). Mode 1 implements set
discharge: the host supplies each child's subject/assumption preimage
address lists as untrusted witness; the guest re-derives every list's
canonical merkle root and asserts it equals the child's committed root,
then commits canonical roots over union(subjects) and
union(assumptions) \ union(subjects). Assumptions proven by any sibling
are discharged in-circuit at the lowest level of the tree where
producer and consumer meet; when the whole env folds, the final
assumptions root collapses to the zero sentinel and the aggregate claim
is unconditional. Mode-1 outputs are canonical set commitments, so
agg-of-agg witness verification is uniform at every level, and the mode
is committed in field_b.
The shard-plan host path folds along the manifest's bisection tree
(pruned to the shards actually proven), executes the arity-collapsed
plan, supplies the discharge witness per agg call, mirrors the
union/discharge host-side, binds the root subject to the canonical root
over the covered union, and binds the root assumptions to the mirror —
reporting closure ("unconditional") or the outstanding count. Each leaf
additionally checks the host's assumption mirror against the guest's
committed root at prove time, so a rule divergence fails fast rather
than as a witness panic deep in the agg.
Measured on mergesort (8 shards, identical leaves, warm GPU): discharge
adds 2.5s to the single arity-16 agg call (19.0s -> 21.5s, +0.9% total
run) and resolves all 2439 cross-shard assumptions to zero.
Also: the Assembly executor is now the default (--emulator opts out).
It is ~4-6x faster at trace generation and a prerequisite for the hints
stream. The historical multi-program breakage (second setup() clobbers
the first program's ASM ROM histogram -> rom.rs:178 panic) is fixed by
ordering: the agg program is set up only after the leaf loop, never
before; the Emulator keeps the upfront setup. Validated end to end on
single-leaf, balanced (4-leaf), unbalanced mixed-vk (3-leaf), and
8-shard runs.
Simplifications, no behavior change (validated: the 3-shard discharge prove reproduces the same leaf steps, discharge counts, and final subject root as before): - Drop the flat mode-0 fallback fold from run_shard_plan. It was reachable only via manifests written before the tree section existed, which regenerate in seconds; aggregation now requires the bisection tree and says so. One fold path instead of two. - Move tree pruning into the kernel as AggNode::prune (where the type lives), with the unit test it was missing. - Drop the leaf_subjects vector (redundant with leaf_subject_sets) and the duplicate cover->Address conversion in the leaf loop. - Extract discharge() — the host mirror of the agg guest's union/resolve — out of the fold loop. - Name build_inputs' return triple (LeafInputs) and tighten build_plan's frontier expansion. - Zero compiler and clippy warnings across ix-kernel and zisk-host (was 12): unused mut, documented dead-code allow on the publics layout field, redundant to_string in format args, indexed loop, then -> then_some. Audited and kept: every CLI flag is read; subject_of_cover, fold_subjects, and tree_partition remain in use by the only-const, range, and reuse paths; the agg guest's two-mode structure is already minimal.
…euse path
--store-dir now composes with --shard-plan (and requires it). A manifest
shard whose certified targets are all covered by stored proofs is not
re-proven: the covering stored proofs are folded into the aggregate as
extra children, where the agg guest verifies them (vk-pinned,
witness-root-checked) and discharges assumptions against their subjects
— reuse by verification, never by trust. Partially-covered shards are
re-proven whole (the shard is the proof unit). Freshly proven shards are
banked per-shard (proofs/<n>.{proof,cover,asm}) as they complete, so an
interrupted run resumes where it left off; .asm carries the assumption
preimages mode-1 discharge needs, and the proofs directory is the
store's single source of truth (proven.bin is gone).
The fold is one op-list executor: novel leaves fold along the pruned
bisection tree as before; stored proofs join as extra slots; a
synthesized tail fold reduces the forest to one verified root, with an
agg-of-1 when a lone stored proof would otherwise go unverified.
Mode-1 commitments are canonical sets, so fold shape never changes the
final claim — validated: fresh-store, full-reuse (0 leaves proven), and
mixed (novel tree + stored tail) runs all commit the identical subject
root and discharge to an unconditional claim.
The legacy standalone reuse path (mode-0 join fold, naive count/byte
chunking, proven.bin) is deleted, along with its private flags
(--closure-inject, --limit-items) and now-orphaned helpers
(fold_subjects, plan_chunks_by_closure, addr_bytes, load_store,
save_store). --require-closed derives the covered set from the proof
index. Net -392 lines; zero compiler/clippy warnings.
…reuse The planner (shard_plan --store-dir) now diffs the env against the proof store before profiling: work items whose targets are all store-covered are skipped entirely — not typechecked (planning gets faster), not partitioned, and their blocks excluded from the hypergraph (a novel→covered delta edge is an assumption discharged at aggregation, not a cut to minimize). The manifest covers only novel work. The host accepts such manifests: grouping requires every work item to be either manifest-assigned or store-covered, and the covering-proof selection generalizes from "reused manifest shards" to "every needed target the novel shards won't certify" — subsuming reused shards and unmanifested covered work. A fixpoint additionally folds stored proofs that cover the assumptions of already-included stored proofs, so a closed store yields an unconditional claim even cross-env. This fixes the cross-env conservatism where the min-cut objective pulled already-proven library constants into novel shards (clustering dependents with dependencies) and one novel constant forced a whole mixed shard to re-prove. Validated: with nataddcomm proven into a store, planning natmulcomm skips 34 of 47 work items, proves 2 novel shards (13 items), folds the 3 stored proofs (verified in-circuit, assumption-closure fixpoint), and discharges to an unconditional claim — 72% of the work reused across envs. Same-env full reuse regression unchanged (identical subject root).
The open-assumption filter scanned stored×subjects per assumption per round — O(assumptions × proofs × subjects), seconds-to-minutes of host CPU at init-store scale (~206 proofs × 51k subjects). One HashSet of included-proof subjects per round makes it linear and reads better. Re-validated: full-reuse fold of the cross-env store commits the identical subject root, outstanding 0, unconditional.
…_utf8EncodeChar_append`
zisk/scripts/bench-cycles.sh runs the standard dumped-input suite (~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies the committed `failures` publics word is zero. check_one mirrors the Zisk guest's one-work-item path natively so IX_* instrumentation can target a single expensive constant.
Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form instead of delta-unfolding into succ^n towers / the division algorithm. Iota over such majors already works via cleanup_nat_offset_major. - linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n iota steps. - def-eq (try_def_eq_offset): decompose both sides to base + offset (literals, succ layers, add-lit read in O(1)) and strip the shared offset in ONE step. The previous one-succ-peel recursed through full is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets. Guest cycles (ziskemu, vs d2ea3d9 baseline): Int16.instRxcHasSize_eq 5,697,166,200 -> 56,028,269 (-99.0%) binsearch 95,800,411 -> 87,735,518 (-8.4%) Nat.add_comm 18,690,815 -> 17,805,210 (-4.7%) mergesort shard 4,005,320,347 -> 3,920,521,476 (-2.1%) Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0. 604 kernel tests pass (one test updated: whnf of `x + 2` now asserts the compact stuck form; new test covers tower-vs-compact def-eq).
Every KExpr/KUniv construction used to compute a blake3 hash of its content for identity (intern key, cache keys, equality) — ~20% of guest cycles on reduction-heavy constants (app_hash 22-33% cumulative). Identity is now an intern-assigned u64 uid: - uids come from a process-global atomic counter and are NEVER reused, so uid equality implies structural equality and cache keys cannot alias across intern-table clears (a stale key can only miss). - InternTable keys on shallow structural keys (ExprKey/UnivKey): variant tag + child uids + semantic payload, mirroring exactly what the old content hash covered (display names, binder info, mdata excluded). intern_expr/intern_univ canonicalize children recursively when handed nodes built outside the table, preserving the old "structural identity" interning semantics. - PartialEq for KExpr/KUniv is now structural with the uid fast path (canonical-vs-canonical compares O(1)); hash_eq remains the fast-but-incomplete uid check for cache/quick-path callers where a false negative only costs a fallback. - Cache ctx components stay blake3 (CtxAddr) but hash uids instead of content addrs; push_local/push_let intern their frame types so equal context suffixes share cache entries. - ingress's hash-first intern probe becomes key-first (same skip-the- construction win, no hash). Soundness note: affirmative identity (equal uid => equal term) is what def-eq quick paths and caches rely on, and it holds by construction; incomplete negatives only cost cache misses. Nat/Str literals keep their blake3 blob Addresses (ingress-meaningful). Guest cycles (ziskemu, vs previous commit): mergesort shard 3,920,521,476 -> 2,526,226,208 (-35.6%) Vector.extract_..._1 7,692,517,694 -> 5,369,879,578 (-30.2%) Nat.add_comm 17,805,210 -> 12,687,822 (-28.7%) binsearch 87,735,518 -> 64,562,040 (-26.4%) Int16.instRxc..._eq 56,028,269 -> 45,024,529 (-19.6%) natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7% 604 kernel tests pass (16 hash-determinism tests updated to assert the preserved property — identity ignores names/bi/mdata — via structural equality, shallow-key equality, or shared-table interning).
The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed all five primitive recognizers (native/bitvec/nat/decidable/string) every iteration; each collects its own app spine and runs its own gauntlet of 32-byte address compares — pure tax on ordinary constant-headed terms. Classify the head once per iteration via an allocation-free app-chain walk + per-address memo (KEnv::prim_family_cache), and call at most the one matching family reducer. Family head-address sets are disjoint, so dispatch order semantics are unchanged. The nat-offset-stuck probe gates on the Nat family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur, measured there as part of -28% on Vector._proof_2). Guest cycles (ziskemu, vs previous commit): mergesort shard 2,526,226,208 -> 2,399,597,758 (-5.0%) Vector.extract_..._1 5,369,879,578 -> 5,112,313,900 (-4.8%) Nat.add_comm 12,687,822 -> 12,249,578 (-3.5%) Int16.instRxc..._eq 45,024,529 -> 43,835,207 (-2.6%) binsearch 64,562,040 -> 64,095,836 (-0.7%) others ~0 (+25 steps cache-population noise on two inputs) 604 kernel tests pass.
Env::get_anon paid one blake3 content-hash per constant at parse time. The binding check moves into LazyConstant::get() (pending_addr field): constants shipped in a closure but never forced by the typechecker are never hashed, while everything the kernel certifies is still verified on materialization — checking a constant forces it. The compile-side from_constant path (pre-populated cache) is exempt as before. Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%, Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector +0.01% (re-verification of the few constants materialized more than once). 200 ixon + 604 kernel tests pass (tampering test updated to assert rejection at get() instead of at parse).
Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).
Records the two-identity-layer boundary after 1e3029d: constants/blobs keep their blake3 Ixon Addresses (claims, Merkle roots, proof store, cross-run reuse, and per-constant inclusion proofs are all unchanged), while the removed blake3 hashing was the kernel-internal per-node scheme used only for intern/cache identity inside one checker run — never serialized, never part of a proof artifact. Documents the soundness argument (equal uid => equal term, never reused), the boundary rule (uids must not escape into serialized artifacts), and what a future sub-constant commitment feature would need to recompute at the egress boundary.
PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.
H-15 (perf): the transient-nat probes run on every whnf/whnf_core call BEFORE the cache lookup; they paid two collect_app_spine Vec allocations plus a KConst::Recr clone per call on the overwhelmingly common non-Nat-recursor path. Pre-filter with an allocation-free spine_head_and_len; spine collection + Recr lookup now happen only for actual Nat-recursor heads. Also drop the redundant rules.clone() in the iota IotaInfo build (rules is already owned from the try_get_const clone). H-12 (hardening): output-size cap (2^26 bits) on natively computed shiftLeft/pow results — a ~12-byte adversarial term could otherwise demand a multi-GiB allocation (cycle explosion in-guest). Oversized results stay stuck, like the existing pow exponent guard. Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to kernel-riscv. Guest cycles (ziskemu, vs previous commit): mergesort shard 2,399,808,204 -> 2,313,445,541 (-3.6%) Vector.extract_..._1 5,111,066,294 -> 4,949,833,244 (-3.2%) Nat.add_comm 12,217,888 -> 11,798,013 (-3.4%) Int16/32/64 rxc ~42.2M -> ~41.3M (-2.1%) binsearch 63,115,445 -> 62,640,316 (-0.75%) small inputs ~0. 604 kernel tests pass.
Closes the adversarial-binding review of the uid-interning design: - ixon: blob bytes are blake3-verified against their address at load in all three env deserializers (ports jcb/fixes H-1). This is the one place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and literal structural equality key on the blob Address (mirroring the old content hash), which is sound only if address binds bytes. Costs +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes; constants stay lazily verified). - kernel: fresh_uid aborts on counter exhaustion instead of wrapping (>2^67 guest cycles to reach; identity soundness should not rest on "probably won't happen"). - kernel: literal structural-equality arms compare values as well as blob addresses (defense in depth — redundant under the load-time check, degrades to inequality if that invariant were ever violated). - kernel: intern hits assert structural equality in debug builds. - docs/kernel_identity.md: adversarial-model section — why uid identity cannot be cache-poisoned (uids are assigned by a sequential counter, never computed from content, so there is no hash function to collide; intern keys are exact structural Eq, never digests; cache hits require uid equality; map-bucket collisions cost a key compare, not a wrong hit). 604 kernel + 200 ixon tests pass.
…d story The *_mdata_with_addr constructors were the only API through which a caller-supplied (hence potentially duplicated) uid could enter a node; nothing outside expr.rs uses them since the ingress key-first rework, so make them private — uid uniqueness per process is now enforced by visibility, not convention. Docs: spell out the within-shard (impossible by assignment) vs between-shards (certain and harmless; uids never cross the process boundary) collision analysis.
Krivine-style machine for the structural WHNF fragment, ported from the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry, b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector -17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM side doesn't have: cheap_proj/cheap_rec threading, fuel retention (adversarial posture), ambient let/LDecl zeta via the exit-to-outer- loop contract, iterative loop for stack discipline, clo_subst readback with lbr guards + per-call scratch, unchanged cache/uid semantics. Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff), with the known curried-sharing regression class added to the bench suite before implementation.
…beta path) Port of the IxVM environment machine (~/ix-aiur d0d3375; design in docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style machine when a beta fires: beta/zeta become O(1) environment pushes onto a persistent cons-list of closures, and substitution materializes only at machine exits (clo_subst readback) — work proportional to what the reduction consumes, not to every beta performed. A beta chain ending in another beta never materializes its intermediate bodies. - subst.rs: Clo/MEnv closure types (carried env length; per-Clo OnceCell readback memo standing in for Aiur's global memoization), clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo from a new InternTable scratch pool, results interned like subst. - whnf.rs: iterative machine_whnf loop + machine_exit, replacing the multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at the eager path's one-substitution-event granularity. Every exit reads back and re-enters the outer loop, so ambient let/LDecl zeta, iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are unchanged; all caches and keys untouched. - bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class sentinels for the known curried-sharing regression class). 609 kernel + 200 ixon tests pass; native check_one true on Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck, Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11 envs. Guest cycle benchmarks pending (box busy).
Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit no longer reads its spine back when the head is a recursor on the main ctor-rule path: only the major premise materializes (readback + the same full whnf the eager path uses), and on a hit the rule RHS re-enters the machine with params/motives/minors and post-major arguments as their ORIGINAL closures plus the ctor fields wrapped closed. The rule's Lam-chain betas push them straight into an environment, so unselected minors — dropped match/Decidable branches, the UTF-8 codec class — are never substituted and never read back. Off-main-path cases miss to the plain readback exit by construction, preserving semantics exactly: K recursors (ctor synthesis needs infer + def-eq), Nat literal majors (transient-work cache discipline and the linear-rec/offset shortcuts stay on the plain path), Str literal majors, struct-eta candidates, and stuck majors. cheap_rec mode skips machine-iota entirely, mirroring the eager gating. A miss costs one readback of the major; the plain path's whnf of it then hits a warm cache. Machine-native delta (IxVM Phase C1.5) does NOT port at this layer: their machine lives in a whnf that includes delta, while ours sits inside whnf_core, which must stay delta-free for def-eq's lazy-delta unfold ordering. Spanning the whnf delta loop with closures is a separate, larger restructure. 610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test covers the spine arithmetic incl. post-major shift); native check_one true on the usual five constants; perf_check whole-env failure parity on 11 envs. Guest cycle benchmarks pending alongside Phase A's.
map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.
fbf4c7a to
4d2452d
Compare
1076c91 to
73be557
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (
ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committedfailurespublics word verified zero.Key Results
Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×);Int32/Int64variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independentVector.extract_append._proof_1−32.4%; the Init Array/VectorExtractproof family −22% to −33% each (five multi-billion-step constants)Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual(28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cyclesdocs/kernel_identity.md), blob bytes verified at load in all deserializersFifteen Commits (oldest first)
bench: guest cycle harness + native single-constant check
bench-cycles.shsuite over dumped guest stdins with failures-word validation;check_oneexample mirrors the guest's reuse-mode check nativelykernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)
Nat.add x lit/Nat.div|mod x kstay stuck as compact offsets instead of materializingsucc^n(x)chains or the division algorithm; bulk offset def-eq decides offset pairs directlyInt16.instRxcHasSize_eqcollapse (5.70 B → 56 M) and the UTF-8-codec-class fixkernel: intern-assigned uids replace per-node blake3 content hashing
kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops
ixon: defer per-constant address verification to first materialization
get(); blob bytes still verified eagerly at loadbench: add Int32/Int64 instRxcHasSize_eq to the cycle suite
docs: kernel uid identity vs Ixon content addressing
kernel: fix PrimFamily visibility/qualification warnings
kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps
kernel+ixon: harden term identity against adversarial inputs
kernel: privatize uid-accepting constructors; document cross-shard uid story
docs: design for the environment-machine WHNF port
kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)
whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_substreadback), so work is proportional to what the reduction consumeskernel: closure-iota at the machine's recursor exit (Phase B)
Decidablebranches) are never substituted and never read backdocs: env-machine WHNF design is implemented (Phases A+B)