Skip to content

Zo support#192

Draft
JanCSEM wants to merge 174 commits into
pulp-platform:develfrom
JanCSEM:zo-support
Draft

Zo support#192
JanCSEM wants to merge 174 commits into
pulp-platform:develfrom
JanCSEM:zo-support

Conversation

@JanCSEM

@JanCSEM JanCSEM commented May 12, 2026

Copy link
Copy Markdown

Describe the intent of your PR here.

Added

Changed

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

runwangdl and others added 30 commits February 12, 2026 15:27
- Cleanup Docker flow to use a temporary build folder
- Install zsh and oh-my-zsh plugin
- Adapt GAP9 Docker Flow to support ARM64
- Add GAP9 Docker GitHub Build Flow
- Add GAP9 Run script to use real hardware
- Update README
- Fix missing pre-commit dependency

WIP
We still need x86 compilation for Autotiler
JanCSEM and others added 30 commits June 14, 2026 15:32
deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000`
(outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are
NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did
a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g.
MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models
HyperRAM as flat RAM so it passed there, masking the bug.

Add IS_L1/IS_L2 on-chip-window macros and use them:
- Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector;
  L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is
  NULL) and already live in HyperRAM, so skip them.
- Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free
  it); on-chip outputs compared in place. Paired malloc/free kept in sync.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2)
tests behave identically; only L3 paths change.
The L3<->L2 tiling used one DMA backend for both single- and double-buffering.
Async DMA only helps double-buffering (it overlaps the next-tile prefetch with
compute); single-buffering waits on each tile before computing, so async gives
SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can
corrupt under deferred waits.

- PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use
  different backends. Backward compatible.
- GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for
  real L3<->L2 prefetch overlap.
- GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't
  waited twice) and cast `${ext}` to uint32_t in the 2D transfer.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
…mple

- TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32,
  DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's
  default -Os. These dominate GAP9 inference cycles.
- deeploytest.c / CMake / sdk config: make the GAP9 example show the three
  L1-memory knobs that let conv-heavy nets fit, with explanatory comments:
    A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer
       (SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1).
    B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE
       (sdk_gvsoc.config) -- alternative to A.
    C. size the cluster-controller stack via conf.cc_stack_size, overridable
       from the build with -DCC_STACK_SIZE=<bytes> (new CMake option).

Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.
…stack)

The tiling argument structs were stack-locals in the dispatching function. The
cluster fork runtime writes its descriptor near the top of the CC/master stack;
a stack-local arg struct placed there can be clobbered before the forked cores
read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct
`static` and assign separately so it lives in static storage, stable across the
forked call.

Generic codegen (ArgumentStructGeneration); benign on other targets. Verified
MatMul --defaultMemLevel L3 on GVSoC: 0/256.
GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous
DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all
same-direction tensors of a tile, so a tile with >1 input emits one get_id but
multiple pushes -> the extra transfers run on channels that are never waited or
freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall).
Switch to PerTensorWaitingStrategy so each tensor gets its own
get_id : push : wait : free, matching the mchan contract.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB
L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack /
slave-stack size / slave-stack->L2), static cluster fork/closure args, and the
per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway,
plus a short GAP9 memory-model primer.
Add gap9_memcheck.py and run it from run_complete_test after the build, before
the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't
(CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and
scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so
over-subscription fails fast with the exact knob instead of a multi-minute GVSoC
hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate
runs (PASS) and test is 0/256.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants