Fix GAP9 L3 Performance and Memory Transfer#197
Conversation
deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000` (outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g. MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models HyperRAM as flat RAM so it passed there, masking the bug. Add IS_L1/IS_L2 on-chip-window macros and use them: - Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector; L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is NULL) and already live in HyperRAM, so skip them. - Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free it); on-chip outputs compared in place. Paired malloc/free kept in sync. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2) tests behave identically; only L3 paths change.
The L3<->L2 tiling used one DMA backend for both single- and double-buffering.
Async DMA only helps double-buffering (it overlaps the next-tile prefetch with
compute); single-buffering waits on each tile before computing, so async gives
SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can
corrupt under deferred waits.
- PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use
different backends. Backward compatible.
- GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for
real L3<->L2 prefetch overlap.
- GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't
waited twice) and cast `${ext}` to uint32_t in the 2D transfer.
Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
📝 WalkthroughWalkthrough
ChangesAsync DB DMA engine split in PULPL3Tiling and GAP9 wiring
IS_L1/IS_L2 macro-based address detection in GAP9 test harness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
DeeployTest/Platforms/GAP9/src/deeploytest.c (1)
178-184:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMissing NULL check on
pi_l2_mallocmay cause crash.If L2 memory is exhausted,
pi_l2_mallocreturns NULL, and the subsequentram_readand comparison will dereference NULL, causing a fault. Adding a check with an error message provides clearer diagnostics than an opaque crash.🛡️ Proposed fix to handle allocation failure
if (!IS_L2(DeeployNetwork_outputs[buf])) { compbuf = pi_l2_malloc(DeeployNetwork_outputs_bytes[buf]); + if (compbuf == NULL) { + printf("Error: failed to allocate L2 scratch for output %u (%u bytes)\r\n", + buf, DeeployNetwork_outputs_bytes[buf]); + return -1; + } ram_read(compbuf, DeeployNetwork_outputs[buf], DeeployNetwork_outputs_bytes[buf]); } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@DeeployTest/Platforms/GAP9/src/deeploytest.c` around lines 178 - 184, The pi_l2_malloc call may return NULL when L2 memory is exhausted, but the code does not check for this failure before passing compbuf to ram_read, which will cause a NULL pointer dereference. Add a NULL check immediately after the pi_l2_malloc call (when IS_L2(DeeployNetwork_outputs[buf]) is false) and handle the error case with an appropriate error message or failure path to prevent the crash.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@DeeployTest/Platforms/GAP9/src/deeploytest.c`:
- Around line 178-184: The pi_l2_malloc call may return NULL when L2 memory is
exhausted, but the code does not check for this failure before passing compbuf
to ram_read, which will cause a NULL pointer dereference. Add a NULL check
immediately after the pi_l2_malloc call (when IS_L2(DeeployNetwork_outputs[buf])
is false) and handle the error case with an appropriate error message or failure
path to prevent the crash.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31b0183f-da2c-45e2-99d4-f43e1f5d2bbe
📒 Files selected for processing (4)
Deeploy/Targets/GAP9/Bindings.pyDeeploy/Targets/GAP9/DMA/L3Dma.pyDeeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.pyDeeployTest/Platforms/GAP9/src/deeploytest.c
…mple
- TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32,
DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's
default -Os. These dominate GAP9 inference cycles.
- deeploytest.c / CMake / sdk config: make the GAP9 example show the three
L1-memory knobs that let conv-heavy nets fit, with explanatory comments:
A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer
(SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1).
B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE
(sdk_gvsoc.config) -- alternative to A.
C. size the cluster-controller stack via conf.cc_stack_size, overridable
from the build with -DCC_STACK_SIZE=<bytes> (new CMake option).
Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.
…stack) The tiling argument structs were stack-locals in the dispatching function. The cluster fork runtime writes its descriptor near the top of the CC/master stack; a stack-local arg struct placed there can be clobbered before the forked cores read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct `static` and assign separately so it lives in static storage, stable across the forked call. Generic codegen (ArgumentStructGeneration); benign on other targets. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all same-direction tensors of a tile, so a tile with >1 input emits one get_id but multiple pushes -> the extra transfers run on channels that are never waited or freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall). Switch to PerTensorWaitingStrategy so each tensor gets its own get_id : push : wait : free, matching the mchan contract. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack / slave-stack size / slave-stack->L2), static cluster fork/closure args, and the per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway, plus a short GAP9 memory-model primer.
Add gap9_memcheck.py and run it from run_complete_test after the build, before the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't (CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so over-subscription fails fast with the exact knob instead of a multi-minute GVSoC hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate runs (PASS) and test is 0/256.
Six GAP9 backend fixes around L1 (TCDM) L2 L3 budgeting several faults. Each is explained in
docs/gap9_backend_fixes.md. Verified on GVSoC (MatMul --defaultMemLevel L3: 0/256); on-chip behaviour unchanged.Added
-O3on the hot forward kernels (Convolution_fp32/DWConvolution_fp32/Gemm), appended after the SDK's-Os.SET_SLAVE_STACK),CONFIG_CL_SLAVE_CORE_STACK_SIZE, andconf.cc_stack_sizevia a new-DCC_STACK_SIZECMake option.Changed
PULPL3TilingoptionaldbDma, defaults todma).static(off-stack) so the fork descriptor can't clobber them.Fixed
memcpy/CPU-deref HyperRAM pointers (IS_L1/IS_L2gating) — was a board-onlyInvalid fetch.PerTensorWaitingStrategy—DirectionWaitingStrategyleaked channels on multi-input tiles →mchan_transfer_waithang..sizereset after wait (avoid double-wait).PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.