Perf/cpu lde rework by jotabulacios · Pull Request #650 · yetanotherco/lambda_vm

jotabulacios · 2026-06-08T14:43:40Z

This PR reworks the trace LDE pipeline (Round 1) to be row-major and replaces the flat radix-2 FFT with a cache-blocked, two-half batched FFT. It also parallelizes trace generation and drops a redundant multiply in the LogUp fingerprint.

Row-major trace LDE + two-half FFT (Round 1)

Round 1 (trace LDE + Merkle commit) is ~48% of the prover wall and paid two avoidable costs: a column → row transpose of the full LDE-size buffer, and a flat FFT whose early-layer stride thrashes cache at large n (memory-bandwidth bound, not compute bound).

The forward/inverse transforms now run as a two-half decomposition — √n-sized sub-FFTs that stay cache-resident, a twiddle pass, then the second half — and the coset-LDE twiddles are precomputed per domain. Since the trace Table is already row-major, Round 1 reads it with one memcpy (no transpose) and commits contiguous Merkle leaves. The committed polynomial is unchanged: byte-identical proof, same roots, same FRI, same verifier.

crypto/math/src/fft/{bowers_fft_batch,two_half_fft}.rs, crypto/math/src/polynomial.rs
crypto/stark/src/{trace,prover}.rs

Parallel trace generation

The ~19 independent table builds in build_traces ran sequentially; they now run in one rayon scope, with chunked tables produced in parallel (trace build ~15.6% → ~10.9% of the wall). Disk-spill keeps the sequential generate→spill order so peak memory stays bounded. Cleanups in the same area: the dedup maps (LT/MUL/DVRM/BRANCH) move SipHash → FxHash, the DECODE pc→row map reuses the executor's U64 hasher, and register routing avoids a per-op Vec.

Table contents are unchanged; the dedup tables' row order can differ (hasher iteration order, already non-deterministic under SipHash's randomized seed) — any order is valid for a lookup table, so proofs stay sound but are not byte-identical to a given main run.

prover/src/tables/{trace_builder,types,lt,mul,dvrm,branch,decode}.rs

Skip the identity multiply in LogUp fingerprints

alpha_powers[0] is always 1, so bus_id · alpha_powers[0] was a multiply by one in both the witness build and the constraint evaluation. It is now the embedded bus_id directly — byte-identical.

crypto/stark/src/lookup.rs

jotabulacios · 2026-06-08T14:43:55Z

/bench

github-actions · 2026-06-08T14:45:39Z

Benchmark — ethrex 20 transfers (median of 3)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	80448 MB	81801 MB	+1353 MB (+1.7%) ⚪
Prove time	50.187s	44.993s	-5.194s (-10.3%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 0.9%, heap: 1.3%)

_{Commit: 602a36b · Baseline: built from main · Runner: self-hosted bench}

MauroToscano · 2026-06-08T15:07:48Z

/bench-growth

jotabulacios · 2026-06-08T15:23:32Z

/bench-growth

…file

jotabulacios · 2026-06-10T18:43:47Z

/bench

jotabulacios · 2026-06-16T18:55:07Z

/bench

jotabulacios · 2026-06-18T14:10:11Z

/bench

github-actions · 2026-06-18T14:17:20Z

Codex Code Review

Findings

High Security: crypto/math/src/fft/bit_reversing.rs exposes a safe public function with an unsafe parallel implementation, but the required layout invariants are only debug_asserts at lines 40 and 48. With parallel enabled, invalid inputs can reach the raw-pointer swap block at lines 56-69. Safe Rust APIs must not permit UB for bad caller input. Make the divisibility and power-of-two checks runtime assert!s before the unsafe path, or return a Result; alternatively make the function unsafe and document the contract.
Medium Security / DoS: prover/src/tables/types.rs replaces SipHash with a deterministic non-keyed FxHasher, then uses it for unbounded op-dedup maps such as branch.rs, lt.rs, mul.rs, and dvrm.rs. If prover inputs are user-controlled, an attacker can choose VM values that collide and degrade trace generation toward quadratic behavior. Keep std::collections::HashMap for adversarial inputs, or use a keyed fast hasher.

Verification

I attempted cargo check --all-targets, but rustup could not create temp files under /home/runner/.rustup/tmp because that path is read-only in this sandbox.

claude · 2026-06-18T14:19:36Z

Code Review

Overview

This PR reworks Round 1 of the STARK prover for better cache behaviour and parallelism:

Replaces the flat radix-2 Bowers FFT with a cache-blocked two-half batched FFT
Makes the LDE pipeline end-to-end row-major (trace → IFFT → coset scale → FFT → Merkle), eliminating the col→row transpose
Parallelises the ~19 independent table builds in build_traces via a rayon::scope
Swaps SipHash dedup maps for FxHashMap and drops a redundant multiply-by-one in LogUp fingerprints

The algorithmic changes look mathematically sound and are well-tested with differential tests against the reference per-column path (both base and ext3). The proof is byte-identical for the lookup optimisation and for the main LDE rework.

Issues

Medium (2 items) — see inline comments

Incomplete unsafe justification in bit_reversing.rs — The SAFETY comment states the row-swap pairs are "pairwise disjoint" but does not prove it. The key lemma is that bit-reverse is an involution (br(br(i)) = i), so every non-trivial orbit is a 2-cycle; selecting the representative where br(i) > i then gives a partition. Without this, reviewers must independently reconstruct the argument. The comment for Ordering::Relaxed also needs a line explaining that Rayon's thread-spawn provides the happens-before edge that makes the relaxed load correct. (The code itself is safe; the documentation is the gap.)
Possible compile error without debug-checks — expand_columns_to_lde is gated #[cfg(feature = "debug-checks")] but recompute_round1_from_trace appears to call it unconditionally. If that caller is not also cfg-gated (or the call site wrapped in #[cfg]), builds without debug-checks will fail. Please verify that CI exercises a build without this feature.

Low (1 item) — see inline comment

CUDA path in columns_to_row_major panics on non-uniform column lengths — A debug_assert_eq! inside the inner loop would catch any future regression cheaply.

What looks good

The two-half FFT implementation and its bit-reversal bookkeeping are clean; the three-step structure (bit-rev → first-half layers → bit-rev → second-half layers → bit-rev) maps directly to the Plonky3 reference.
The LDE function validates all preconditions with proper Err returns before touching the buffer.
Differential tests cover base field, extension field, multiple column counts, and blowup factors — good coverage of the new code paths.
The rayon::scope + per-slot Option pattern in build_traces is verbose but correct; the PHASE5_RAN expect-string gives a clear panic message if the invariant breaks.
FxHashMap substitution for the dedup tables and the removal of the identity multiply in LogUp are both straightforward, low-risk cleanups.

MauroToscano

Requesting changes pending the GPU-side review.

This reworks the CPU trace-LDE pipeline (row-major LDE + cache-blocked two-half FFT + per-domain precomputed coset twiddles), but the diff doesn't touch crypto/math-cuda — there are no GPU/CUDA changes here. The "byte-identical committed polynomial" claim has only been established for the CPU path; the same rework is not obviously equivalent on the GPU, so we need to confirm the GPU LDE/FFT produces the same commitment under the new layout/ordering (or port the rework to the GPU side) before this lands — otherwise CPU- and GPU-built proofs could diverge.

Holding merge until the GPU equivalence is sorted out (Juampi is looking at that side).

jotabulacios · 2026-06-25T13:49:24Z

/bench

MauroToscano · 2026-06-25T22:08:43Z

/review-ai

MauroToscano · 2026-06-25T22:09:03Z

/bench

github-actions · 2026-06-25T22:10:47Z

Codex Code Review

No issues found in the PR diff.

I reviewed the changed FFT/LDE paths, row-major trace storage/commitment changes, CUDA handle call sites, preprocessed split commitments, and trace-builder parallelization. I did not run builds or tests per the review constraints.

claude · 2026-06-25T22:14:01Z

Review: Row-major trace LDE + two-half FFT (PR #650)

Thorough static review of the full diff. No correctness, safety, or soundness bugs found. The PR is carefully engineered and well-tested. Summary of what I checked plus a couple of low-severity notes.

Safety / soundness — looks solid

in_place_bit_reverse_permute_row_major unsafe parallel path (bit_reversing.rs): the AtomicPtr + raw-pointer swap is sound. The disjointness argument is correct — for power-of-two n, bit-reverse is an involution, so the j > i filter selects one representative per 2-cycle, giving pairwise-disjoint num_cols-wide row ranges. The power-of-two precondition is enforced with a real assert! (not debug_assert!), so a bad caller panics instead of racing. Good call making that one non-debug.
Table::row_major_data unsafe mmap slice (table.rs): mirrors the existing get_row contract (repr-transparent + spill writes row-major), lifetime tied to &self. Consistent with existing usage.
fft_batch_two_half: validates num_cols, power-of-two n, and log_n == tw.log_n before any unsafe/indexing. Second-half twiddle indexing (first_block + b) stays within bitrev_tw (len n/2). Differential tests cover fwd+inv across log_n 2..=10, m in {1,3,7}, plus the ext3 path — strong coverage.
Byte-identical claim: commit_rows_bit_reversed produces the same leaf bytes (same row values, same column order, same reverse_index) as the column-major path; the LDE differential tests confirm per-cell equivalence. The GPU columns_to_row_major interleave preserves the exact value get_main/get_aux returned before, so downstream R2-R4 indexing is unchanged.

Concurrency / determinism

Parallel chunk_and_generate + the rayon::scope fan-out in build_traces: each closure reads disjoint op lists and writes its own slot; disk-spill correctly stays sequential to keep peak memory bounded. No data races; peak heap is unchanged (all tables were live at the end already).
FxHashMap swap and the acknowledged non-deterministic dedup-row order are sound for multiset lookup tables. Fine.

Minor (Low)

commit_rows_bit_reversed_subset (prover.rs): col_end <= num_cols is only a debug_assert!. In a release build, a caller passing col_end > num_cols would silently read into adjacent-row data (wrong commitment) rather than fail loudly. Callers are all internal with controlled args, so risk is low — but since this is a commitment primitive, a hard bounds check (or returning None) would be cheap insurance.
count_table_lengths scratch reuse + collect_register_ops_from_cpu now appending into a caller-owned Vec is semantically equivalent to the old extend; verified both call sites (accumulator path doesn't clear, scratch path clears first). Correct.

Dead code — handled well
Legacy LayerTwiddles fields, expand_columns_to_lde, and compute_lde_from_columns_cached are properly cfg-gated to test/test-utils/debug-checks, so release builds don't carry them. No leftover unreachable paths.

Nice work — the differential tests against the single-column Bowers oracle give good confidence in the FFT rewrite.

github-actions · 2026-06-25T22:43:17Z

AI Review

PR #650 · 19 changed files

Findings

Status	Sev	Location	Finding	Found by
confirmed	low	crypto/math/src/fft/bit_reversing.rs:66	Misleading comment about memory savings in bit_reversing.rs	moonmath zro/minimax-m3
confirmed	low	crypto/stark/src/prover.rs:324	LdeTwiddles builds two_half_inv/two_half_fwd unconditionally even when only legacy fields are used	minimax minimax/MiniMax-M3
confirmed	low	crypto/stark/src/prover.rs:646	commit_rows_bit_reversed_subset uses debug_assert for col_end bounds and divisibility	glm openrouter/z-ai/glm-5.2 minimax minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-002: Misleading comment about memory savings in bit_reversing.rs

Status: confirmed
Severity: low
Location: crypto/math/src/fft/bit_reversing.rs:66
Found by: moonmath:zro/minimax-m3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The comment "saves ~16 MB at log21 n=64" appears to be off by a factor of 2 on 64-bit platforms and has a confusing/typo'd wording ("log21 n=64").

Evidence

At log_n=21, n=2^21=2097152. An upfront Vec<(usize, usize)> collection would use 16 bytes per pair on 64-bit (or 8 bytes on 32-bit). On 64-bit: 2^21 * 16 = 33554432 bytes ≈ 32 MB, not 16 MB. On 32-bit: 2^21 * 8 = 16 MB. The wording "log21 n=64" is unclear — likely meant "log_n=21, n=2^21=2M" or just "log_n=21".

Suggested fix

Update the comment to: "// No upfront Vec<(usize, usize)> collection (saves ~32 MB at log_n=21 on 64-bit)."

AI-006: LdeTwiddles builds two_half_inv/two_half_fwd unconditionally even when only legacy fields are used

Status: confirmed
Severity: low
Location: crypto/stark/src/prover.rs:324
Found by: minimax:minimax/MiniMax-M3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

TwoHalfTwiddles::new is always called in LdeTwiddles::new, including on builds that do not use it (e.g., debug-checks-only with precomputed test-utils)

Evidence

The struct LdeTwiddles keeps two_half_inv/two_half_fwd as unconditional fields, and LdeTwiddles::new builds them without any cfg gate (lines 324-327). In a debug-checks build that never calls coset_lde_full_expand_row_major, these twiddles are still allocated (the forward set for size n*blowup_factor is the largest single allocation in the struct).

Suggested fix

If the intent is to keep release builds free of the legacy per-column path, this is fine. Otherwise, gate the two-half allocations under a feature flag or only build them when coset_lde_full_expand_row_major will be called.

AI-007: commit_rows_bit_reversed_subset uses debug_assert for col_end bounds and divisibility

Status: confirmed
Severity: low
Location: crypto/stark/src/prover.rs:646
Found by: glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The debug_assert checks col_end <= num_cols and data.len() % num_cols == 0 only in debug builds. In release builds, a malformed call with col_end > num_cols or a non-divisible data length would index out of bounds (row_start + col_end could exceed data.len()) causing a panic in slice indexing rather than a clean error. The production callers always pass valid inputs (total_cols from trace width), so this is not currently reachable with bad data.

Evidence

Lines 645-646 use debug_assert! for col_end <= num_cols and data.len() % num_cols == 0; line 668 slices data[row_start+col_start..row_start+col_end] which would panic on out-of-bounds in release if invariants were violated.

Suggested fix

Acceptable as-is given all in-tree callers pass valid widths. Could upgrade to a checked assert if this ever becomes a public/external API.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	success	3
kimi	openrouter/moonshotai/kimi-k2.7-code	general	error: agentic lane timed out after 1800s	0
minimax	minimax/MiniMax-M3	general	success	6
moonmath	zro/minimax-m3	general	success	3
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	success	3	7	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (7) — rejected by the verifier

Removed test-flake targets in Makefile: test-no-compile and test-rust-no-compile (Makefile:3, found by minimax:minimax/MiniMax-M3) — The Makefile genuinely removes test-no-compile/test-asm-no-compile/test-rust-no-compile targets and simplifies test-executor to compile-programs then cargo test. The compile-programs targets were reworked with pattern rules that make incremental recompilation efficient, so the no-compile variants are no longer useful. This is an intentional design change, not a bug. The claim that it's 'not obvious from the diff alone' is commentary, not an issue.
Potentially confusing SAFETY comment about AtomicPtr load ordering (crypto/math/src/fft/bit_reversing.rs:68, found by moonmath:zro/minimax-m3, minimax:minimax/MiniMax-M3) — The Relaxed ordering on the AtomicPtr load is sound. The pointer is written via AtomicPtr::new (a non-atomic store) before into_par_iter() starts. All Rayon thread spawns happen after that write, and the Rayon scope guarantees the initial write is visible. This is a well-defined Rust pattern — Relaxed is sufficient when there is an external happens-before edge (thread spawn). Using Acquire would add unnecessary overhead. The claim about 'implementation-specific behavior' is incorrect; this is standard Rust memory model semantics.
Parallel bit-reverse raw-pointer path holds &mut slice aliasing across threads via AtomicPtr (crypto/math/src/fft/bit_reversing.rs:69, found by glm:openrouter/z-ai/glm-5.2) — This finding duplicates AI-003's concern about AtomicPtr + Relaxed ordering in the same parallel bit-reverse code path. The disjointness argument is sound: n is power-of-two (runtime asserted), bit-reverse is an involution, j>i filter selects disjoint pairs. The Relaxed load is correct because the AtomicPtr is written before par_iter starts, and thread spawn provides happens-before. The code explicitly documents this in the SAFETY comment.
Duplicate transpose in reconstruct_round1 (crypto/stark/src/prover.rs:988, found by moonmath:zro/minimax-m3) — The reconstruct_round1 debug path (prover.rs lines 978-996) LDE-expands columns in column-major form via expand_columns_to_lde, then manually transposes to row-major to feed from_row_major. The claim says from_columns 'already does this transpose internally' and using it would avoid duplication. But LDETraceTable::from_columns (trace.rs) does the exact same col→row transpose internally — there is no duplicate work, just a choice of where the transpose happens. The comment 'debug path: correctness over speed' correctly acknowledges this path is not performance-critical. This is not an issue.
spawn_into! macro moves a &mut slot into the rayon::scope closure (prover/src/tables/trace_builder.rs:3023, found by minimax:minimax/MiniMax-M3) — The spawn_into! macro (trace_builder.rs lines 3023-3028) moves a &mut reference to a local Option slot into a rayon::scope closure. This is a standard Rust pattern — each slot is a local variable declared in the preceding scope, and the borrow checker ensures they live long enough. The claim that it 'would break' if slots were different types is meaningless because Rust's type system would catch that at compile time. The code is idiomatic and correct.
FxHasher write() byte-wise path deviates from expected word-at-a-time hashing (prover/src/tables/types.rs:925, found by glm:openrouter/z-ai/glm-5.2) — FxHasher::write (types.rs lines 992-996) processes byte slices one byte at a time, calling self.add(b as u64) per byte. All current key types (BranchOperation, LtOperation, DvrmOperation, MulOperation) derive Hash over u64/bool fields, which Rust's derive macro dispatches to write_u64/write_u8 respectively — write() is never called for these keys. The byte-at-a-time path would be suboptimal for &[u8]/String fields, but no current key contains those. The code comment explicitly documents the fast-path design and that collisions only cost probes. The claim is not a bug, just an observation about a non-activated code path. The finding itself acknowledges this in its own evidence section.
Dead code: FxHasher is wrapped in a default BuildHasherDefault and never tested for hash quality (prover/src/tables/types.rs:979, found by minimax:minimax/MiniMax-M3) — The claim that FxHasher is 'never tested for hash quality' and 'operations with mostly-zero payload will all hash similarly' is speculative. The hash function rotates and multiplies by a fixed seed per word, which provides good bit mixing. For DvrmOperation with d=0, the hash state still accumulates values from the n (numerator), signed flag, and all other fields before finishing — the final hash will differ. The code explicitly documents that 'collisions only cost probes, never soundness' and the maps are bounded by max_rows per chunk. Without actual collision measurements, this is a performance hypothesis, not a bug.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

jotabulacios · 2026-06-26T19:03:33Z

/bench

jotabulacios added 4 commits June 7, 2026 21:36

add row major batched lde fft primitives

cc48f43

Make LDETraceTable row-major

46c0080

Wire prover to row-major batched LDE

a8db4ee

read trace row major

c55dc05

jotabulacios and others added 7 commits June 8, 2026 17:57

Move the batched-FFT and row-major-LDE unit tests into corresponding …

59f7c0d

…file

resolve conflicts

4bd73d4

Merge branch 'main' into perf/cpu-lde-rework

ab786f5

fix disk-spill EmptyCommitment in row-major LDE

08eaa8b

Merge branch 'main' into perf/cpu-lde-rework

8dc41fb

Parallelize trace build and speed up op-dedup bookkeeping

bd1b9fd

Skip the identity multiply by alpha_powers[0] in LogUp fingerprints

e6a7297

solve conflicts

26a2bcd

Remove dead FFT module and gate legacy twiddles

665d82a

jotabulacios marked this pull request as ready for review June 18, 2026 14:14

claude Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread crypto/math/src/fft/bit_reversing.rs

claude Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs

claude Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs

jotabulacios and others added 4 commits June 18, 2026 11:44

Harden parallel row-major bit-reverse permute

b69edd3

Guard columns_to_row_major; clarify hasher doc

47ded68

solve conflicts

14bac6b

Merge branch 'main' into perf/cpu-lde-rework

18d1058

Merge branch 'main' into perf/cpu-lde-rework

56fe834

diegokingston approved these changes Jun 18, 2026

View reviewed changes

Merge branch 'main' into perf/cpu-lde-rework

3770047

MauroToscano requested changes Jun 24, 2026

View reviewed changes

jotabulacios added 2 commits June 24, 2026 12:39

Deduplicate commit_rows_bit_reversed and bit_reverse_vec

2191f65

solve conflicts

ea32589

jotabulacios mentioned this pull request Jun 25, 2026

perf/row-major trace LDE for GPU path #715

Draft

Merge branch 'main' into perf/cpu-lde-rework

b5eb9d2

MauroToscano mentioned this pull request Jun 25, 2026

Perf/cpu lde rework with default hasher #722

Closed

MauroToscano and others added 3 commits June 25, 2026 19:57

Fix bit-reverse memory savings comment

aa524ec

use default hasher for op dedup maps

882ee4c

Use std HashMap directly for op-dedup maps

602a36b

MauroToscano approved these changes Jun 26, 2026

View reviewed changes

MauroToscano added this pull request to the merge queue Jun 26, 2026

Merged via the queue into main with commit 3bb9107 Jun 26, 2026
12 checks passed

MauroToscano deleted the perf/cpu-lde-rework branch June 26, 2026 19:26

Uh oh!

Conversation

jotabulacios commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Row-major trace LDE + two-half FFT (Round 1)

Parallel trace generation

Skip the identity multiply in LogUp fingerprints

Uh oh!

jotabulacios commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — ethrex 20 transfers (median of 3)

Uh oh!

MauroToscano commented Jun 8, 2026

Uh oh!

jotabulacios commented Jun 8, 2026

Uh oh!

jotabulacios commented Jun 10, 2026

Uh oh!

jotabulacios commented Jun 16, 2026

Uh oh!

jotabulacios commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 18, 2026

Code Review

Overview

Issues

What looks good

Uh oh!

MauroToscano left a comment

Choose a reason for hiding this comment

Uh oh!

jotabulacios commented Jun 25, 2026

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

MauroToscano commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Codex Code Review

Uh oh!

claude Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

jotabulacios commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jotabulacios commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading