Zo support#192
Draft
JanCSEM wants to merge 174 commits into
Draft
Conversation
- Cleanup Docker flow to use a temporary build folder - Install zsh and oh-my-zsh plugin - Adapt GAP9 Docker Flow to support ARM64 - Add GAP9 Docker GitHub Build Flow - Add GAP9 Run script to use real hardware - Update README - Fix missing pre-commit dependency WIP
We still need x86 compilation for Autotiler
…ew evk GAP9 board
…an added dimension...
deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000` (outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g. MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models HyperRAM as flat RAM so it passed there, masking the bug. Add IS_L1/IS_L2 on-chip-window macros and use them: - Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector; L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is NULL) and already live in HyperRAM, so skip them. - Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free it); on-chip outputs compared in place. Paired malloc/free kept in sync. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2) tests behave identically; only L3 paths change.
The L3<->L2 tiling used one DMA backend for both single- and double-buffering.
Async DMA only helps double-buffering (it overlaps the next-tile prefetch with
compute); single-buffering waits on each tile before computing, so async gives
SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can
corrupt under deferred waits.
- PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use
different backends. Backward compatible.
- GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for
real L3<->L2 prefetch overlap.
- GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't
waited twice) and cast `${ext}` to uint32_t in the 2D transfer.
Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
…mple
- TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32,
DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's
default -Os. These dominate GAP9 inference cycles.
- deeploytest.c / CMake / sdk config: make the GAP9 example show the three
L1-memory knobs that let conv-heavy nets fit, with explanatory comments:
A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer
(SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1).
B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE
(sdk_gvsoc.config) -- alternative to A.
C. size the cluster-controller stack via conf.cc_stack_size, overridable
from the build with -DCC_STACK_SIZE=<bytes> (new CMake option).
Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.
…stack) The tiling argument structs were stack-locals in the dispatching function. The cluster fork runtime writes its descriptor near the top of the CC/master stack; a stack-local arg struct placed there can be clobbered before the forked cores read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct `static` and assign separately so it lives in static storage, stable across the forked call. Generic codegen (ArgumentStructGeneration); benign on other targets. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all same-direction tensors of a tile, so a tile with >1 input emits one get_id but multiple pushes -> the extra transfers run on channels that are never waited or freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall). Switch to PerTensorWaitingStrategy so each tensor gets its own get_id : push : wait : free, matching the mchan contract. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack / slave-stack size / slave-stack->L2), static cluster fork/closure args, and the per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway, plus a short GAP9 memory-model primer.
Add gap9_memcheck.py and run it from run_complete_test after the build, before the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't (CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so over-subscription fails fast with the exact knob instead of a multi-minute GVSoC hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate runs (PASS) and test is 0/256.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the intent of your PR here.
Added
Changed
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.