Fix GAP9 L3 Performance and Memory Transfer by runwangdl · Pull Request #197 · pulp-platform/Deeploy

runwangdl · 2026-06-19T00:00:09Z

Six GAP9 backend fixes around L1 (TCDM) L2 L3 budgeting several faults. Each is explained in docs/gap9_backend_fixes.md. Verified on GVSoC (MatMul --defaultMemLevel L3: 0/256); on-chip behaviour unchanged.

Added

-O3 on the hot forward kernels (Convolution_fp32/DWConvolution_fp32/Gemm), appended after the SDK's -Os.
L1-memory knobs in the GAP9 example: slave-stacks→L2 (SET_SLAVE_STACK), CONFIG_CL_SLAVE_CORE_STACK_SIZE, and conf.cc_stack_size via a new -DCC_STACK_SIZE CMake option.

Changed

L3 tiling DMA: single-buffer→blocking, double-buffer→async (PULPL3Tiling optional dbDma, defaults to dma).
Temporary cluster fork/closure argument structs emitted static (off-stack) so the fork descriptor can't clobber them.

Fixed

GAP9 L3 board tests: no longer memcpy/CPU-deref HyperRAM pointers (IS_L1/IS_L2 gating) — was a board-only Invalid fetch.
GAP9 cluster (mchan) DMA: PerTensorWaitingStrategy — DirectionWaitingStrategy leaked channels on multi-input tiles → mchan_transfer_wait hang.
L3 DMA future .size reset after wait (avoid double-wait).

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review. (docker not modified)

deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000` (outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g. MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models HyperRAM as flat RAM so it passed there, masking the bug. Add IS_L1/IS_L2 on-chip-window macros and use them: - Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector; L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is NULL) and already live in HyperRAM, so skip them. - Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free it); on-chip outputs compared in place. Paired malloc/free kept in sync. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2) tests behave identically; only L3 paths change.

The L3<->L2 tiling used one DMA backend for both single- and double-buffering. Async DMA only helps double-buffering (it overlaps the next-tile prefetch with compute); single-buffering waits on each tile before computing, so async gives SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can corrupt under deferred waits. - PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use different backends. Backward compatible. - GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for real L3<->L2 prefetch overlap. - GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't waited twice) and cast `${ext}` to uint32_t in the 2D transfer. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

coderabbitai · 2026-06-19T00:04:20Z

📝 Walkthrough

Walkthrough

PULPL3Tiling gains an optional dbDma parameter so DB tiling can use a different async DMA backend than SB tiling. Two code-generation bugs in GAP9L3Dma are fixed (descriptor size reset and uint32_t cast). Both GAP9 transformers are wired to pass GAP9L3Dma() as dbDma. The GAP9 test harness replaces hardcoded address thresholds with IS_L1/IS_L2 macros.

Changes

Async DB DMA engine split in PULPL3Tiling and GAP9 wiring

Layer / File(s)	Summary
PULPL3Tiling optional dbDma parameter and wiring `Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py`	`PULPL3Tiling.__init__` now accepts `dbDma: Optional[AsyncDma] = None`, defaulting to `dma`. `SB`/`profilingSB` use `dma`; `DB`/`profilingDB` use `dbDma`.
GAP9L3Dma codegen bug fixes `Deeploy/Targets/GAP9/DMA/L3Dma.py`	`_waitTemplate` emits `${name}.size = 0` after `pi_cl_ram_copy_wait`. The direction-2 transfer template casts `${ext}` to `uint32_t`.
GAP9Transformer and GAP9ClusterTransformer wiring `Deeploy/Targets/GAP9/Bindings.py`	Imports `GAP9L3Dma`; both transformers now pass `dbDma=GAP9L3Dma()` to `PULPL3Tiling`, separating async DB transfers from blocking SB transfers.

IS_L1/IS_L2 macro-based address detection in GAP9 test harness

Layer / File(s)	Summary
IS_L1/IS_L2 macros and updated input/output handling `DeeployTest/Platforms/GAP9/src/deeploytest.c`	Defines `IS_L1`/`IS_L2` macros as on-chip address window checks and replaces all hardcoded numeric address thresholds for input copy, output DMA-read scratch allocation, and `compbuf` deallocation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pulp-platform/Deeploy#143: Introduced the GAP9 DMA/platform infrastructure (GAP9L3Dma, GAP9L3DmaFuture, PULPL3Tiling) that this PR directly extends with the async DB DMA split and codegen fixes.

Suggested reviewers

Victor-Jung

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix GAP9 L3 Performance and Memory Transfer' accurately captures both main components of the changeset: L3 memory handling fixes and L3 tiling DMA performance optimization.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is directly related to the changeset, covering L3 memory handling fixes, DMA optimizations, and test harness corrections that align with the code changes shown.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

DeeployTest/Platforms/GAP9/src/deeploytest.c (1)

178-184: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing NULL check on pi_l2_malloc may cause crash.

If L2 memory is exhausted, pi_l2_malloc returns NULL, and the subsequent ram_read and comparison will dereference NULL, causing a fault. Adding a check with an error message provides clearer diagnostics than an opaque crash.

🛡️ Proposed fix to handle allocation failure

     if (!IS_L2(DeeployNetwork_outputs[buf])) {
       compbuf = pi_l2_malloc(DeeployNetwork_outputs_bytes[buf]);
+      if (compbuf == NULL) {
+        printf("Error: failed to allocate L2 scratch for output %u (%u bytes)\r\n",
+               buf, DeeployNetwork_outputs_bytes[buf]);
+        return -1;
+      }
       ram_read(compbuf, DeeployNetwork_outputs[buf],
                DeeployNetwork_outputs_bytes[buf]);
     } else {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@DeeployTest/Platforms/GAP9/src/deeploytest.c` around lines 178 - 184, The
pi_l2_malloc call may return NULL when L2 memory is exhausted, but the code does
not check for this failure before passing compbuf to ram_read, which will cause
a NULL pointer dereference. Add a NULL check immediately after the pi_l2_malloc
call (when IS_L2(DeeployNetwork_outputs[buf]) is false) and handle the error
case with an appropriate error message or failure path to prevent the crash.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@DeeployTest/Platforms/GAP9/src/deeploytest.c`:
- Around line 178-184: The pi_l2_malloc call may return NULL when L2 memory is
exhausted, but the code does not check for this failure before passing compbuf
to ram_read, which will cause a NULL pointer dereference. Add a NULL check
immediately after the pi_l2_malloc call (when IS_L2(DeeployNetwork_outputs[buf])
is false) and handle the error case with an appropriate error message or failure
path to prevent the crash.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31b0183f-da2c-45e2-99d4-f43e1f5d2bbe

📥 Commits

Reviewing files that changed from the base of the PR and between 15a7841 and db36d89.

📒 Files selected for processing (4)

Deeploy/Targets/GAP9/Bindings.py
Deeploy/Targets/GAP9/DMA/L3Dma.py
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py
DeeployTest/Platforms/GAP9/src/deeploytest.c

…mple - TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32, DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's default -Os. These dominate GAP9 inference cycles. - deeploytest.c / CMake / sdk config: make the GAP9 example show the three L1-memory knobs that let conv-heavy nets fit, with explanatory comments: A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer (SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1). B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE (sdk_gvsoc.config) -- alternative to A. C. size the cluster-controller stack via conf.cc_stack_size, overridable from the build with -DCC_STACK_SIZE=<bytes> (new CMake option). Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.

…stack) The tiling argument structs were stack-locals in the dispatching function. The cluster fork runtime writes its descriptor near the top of the CC/master stack; a stack-local arg struct placed there can be clobbered before the forked cores read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct `static` and assign separately so it lives in static storage, stable across the forked call. Generic codegen (ArgumentStructGeneration); benign on other targets. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all same-direction tensors of a tile, so a tile with >1 input emits one get_id but multiple pushes -> the extra transfers run on channels that are never waited or freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall). Switch to PerTensorWaitingStrategy so each tensor gets its own get_id : push : wait : free, matching the mchan contract. Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.

Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack / slave-stack size / slave-stack->L2), static cluster fork/closure args, and the per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway, plus a short GAP9 memory-model primer.

Add gap9_memcheck.py and run it from run_complete_test after the build, before the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't (CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so over-subscription fails fast with the exact knob instead of a multi-minute GVSoC hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate runs (PASS) and test is 0/256.

runwangdl added 2 commits June 18, 2026 23:39

runwangdl requested review from Victor-Jung and Xeratec as code owners June 19, 2026 00:00

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

runwangdl marked this pull request as draft June 19, 2026 00:05

runwangdl added 5 commits June 19, 2026 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GAP9 L3 Performance and Memory Transfer#197

Fix GAP9 L3 Performance and Memory Transfer#197
runwangdl wants to merge 7 commits into
pulp-platform:develfrom
runwangdl:fix-gap9-l3-board

runwangdl commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

runwangdl commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

runwangdl commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading