Zo support by JanCSEM · Pull Request #192 · pulp-platform/Deeploy

JanCSEM · 2026-05-12T06:59:26Z

Describe the intent of your PR here.

Added

Changed

Fixed

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

- Cleanup Docker flow to use a temporary build folder - Install zsh and oh-my-zsh plugin - Adapt GAP9 Docker Flow to support ARM64 - Add GAP9 Docker GitHub Build Flow - Add GAP9 Run script to use real hardware - Update README - Fix missing pre-commit dependency WIP

We still need x86 compilation for Autotiler

…ls_first state

…ew evk GAP9 board

…an added dimension...

deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000` (outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g. MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models HyperRAM as flat RAM so it passed there, masking the bug. Add IS_L1/IS_L2 on-chip-window macros and use them: - Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector; L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is NULL) and already live in HyperRAM, so skip them. - Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free it); on-chip outputs compared in place. Paired malloc/free kept in sync. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2) tests behave identically; only L3 paths change.

The L3<->L2 tiling used one DMA backend for both single- and double-buffering. Async DMA only helps double-buffering (it overlaps the next-tile prefetch with compute); single-buffering waits on each tile before computing, so async gives SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can corrupt under deferred waits. - PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use different backends. Backward compatible. - GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for real L3<->L2 prefetch overlap. - GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't waited twice) and cast `${ext}` to uint32_t in the 2D transfer. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

…mple - TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32, DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's default -Os. These dominate GAP9 inference cycles. - deeploytest.c / CMake / sdk config: make the GAP9 example show the three L1-memory knobs that let conv-heavy nets fit, with explanatory comments: A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer (SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1). B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE (sdk_gvsoc.config) -- alternative to A. C. size the cluster-controller stack via conf.cc_stack_size, overridable from the build with -DCC_STACK_SIZE=<bytes> (new CMake option). Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.

…stack) The tiling argument structs were stack-locals in the dispatching function. The cluster fork runtime writes its descriptor near the top of the CC/master stack; a stack-local arg struct placed there can be clobbered before the forked cores read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct `static` and assign separately so it lives in static storage, stable across the forked call. Generic codegen (ArgumentStructGeneration); benign on other targets. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all same-direction tensors of a tile, so a tile with >1 input emits one get_id but multiple pushes -> the extra transfers run on channels that are never waited or freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall). Switch to PerTensorWaitingStrategy so each tensor gets its own get_id : push : wait : free, matching the mchan contract. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack / slave-stack size / slave-stack->L2), static cluster fork/closure args, and the per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway, plus a short GAP9 memory-model primer.

Add gap9_memcheck.py and run it from run_complete_test after the build, before the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't (CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so over-subscription fails fast with the exact knob instead of a multi-minute GVSoC hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate runs (PASS) and test is 0/256.

runwangdl and others added 30 commits February 12, 2026 15:27

Deeploy Microbenchmark with GVSoC CSR and Demo on GEMM

49cddd2

Add float concat and Change padding pattern of ConV

b260e4e

Merge branch 'devel' into sleepvit

7b55d0a

Support SleepViT on Gap9

5a79c79

Add microbenchmark to codepass

609179c

Fix spelling mistakes and remove dependencies from fork

97c2d2b

Fix Missing Version Link

3423c54

Temporarily disable GAP9 on forks

d6b6ac9

Add Shell Format pre-commit

daf8cda

Update to GAP9 SDK v5.21.1-staging-1

f1c7d57

Print memory usage by default

646563d

Cleanup Makefile

b2b43a5

Try to fix private GAP9 SDK access issue

1a075d0

Use pre-build GAP9 GCC

2bb1bf5

Fix Typos

af405a2

Partially revert a16c1c7

c6bc2c6

We still need x86 compilation for Autotiler

Build AutoTiler

d325f79

CodeAIRabbit Feedback

53b4bb9

Update Changelog

6d1c8c3

CodeAIRabbit Feedback

6db1c52

Add single kernel tests for random perturbations

ba6c1e8

Add ZO model tests

90edc44

Cherry picked NCHW->NHWC transform issue

4bcefd3

FIX: Bug in conv2D parser assigning kernel shape regardless of channe…

44699b2

…ls_first state

Fix compilation bugs caused by ZO nodes

2b5f611

Add option to deploy on the board for the GAP9 platform

846d4dd

WIP: Better error message when attaching usbip and vid + pid of the n…

ae3c4d1

…ew evk GAP9 board

remove prints

02ac6c9

Remove debugging test, add Eggroll test+

5911375

JanCSEM and others added 30 commits June 14, 2026 15:32

Tentative FiX in matrixVectorTemplate

8501663

Fix mixed precision RQS graphs

e5c72a7

Add more prints to the MatMul RQ kernel

529dba2

Further hang debug prints

50afbd2

Replace debug prints with memory markers

f370a1f

Remove debug logic

e3aedb1

Fixed issue in the Unary tile constraint where the output tensor has …

b448073

…an added dimension...

Add flag for parallel vs sequential Rademacher

63cf5c9

Add microbenchmark to the sequential RAD kernel for running

92665fa

Add microbenchmark to the sequential RAD kernel for running

65213fd

Fix declaration issue for microbench mesurement

fdbc7bb

Fix cluster team barrier redundancy causing hang in GEMM

6d098e8

Add pass to fold ReLU6 into RequantShift to avoid cast to Fp32

ca2d78b

Remove debug models, update Quantized models

cd60df1

cherry pick Tiling tables storage move from L1 to L2.

ec3b566

fix issue with input quant nodes in MCUNet

7f22b01

fold transpose into GEMM to save memory for QTSDRZO

b49f8ad

merge GAP9 L3 patches

d37fe89

typo fix in CMakelists + add peak mem extraction script

ff6fbd1

FIX hardcoded readelf path in memcheck

98cf147

Regenerate SleepConViT with Rad

acf24a8

Fix SleepConViTZO

450b6da

regen QSLeepConvIT

a81eab2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zo support#192

Zo support#192
JanCSEM wants to merge 174 commits into
pulp-platform:develfrom
JanCSEM:zo-support

JanCSEM commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JanCSEM commented May 12, 2026

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants