Skip to content

Fix GAP9 L3 Performance and Memory Transfer#197

Draft
runwangdl wants to merge 7 commits into
pulp-platform:develfrom
runwangdl:fix-gap9-l3-board
Draft

Fix GAP9 L3 Performance and Memory Transfer#197
runwangdl wants to merge 7 commits into
pulp-platform:develfrom
runwangdl:fix-gap9-l3-board

Conversation

@runwangdl

@runwangdl runwangdl commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Six GAP9 backend fixes around L1 (TCDM) L2 L3 budgeting several faults. Each is explained in docs/gap9_backend_fixes.md. Verified on GVSoC (MatMul --defaultMemLevel L3: 0/256); on-chip behaviour unchanged.

Added

  • -O3 on the hot forward kernels (Convolution_fp32/DWConvolution_fp32/Gemm), appended after the SDK's -Os.
  • L1-memory knobs in the GAP9 example: slave-stacks→L2 (SET_SLAVE_STACK), CONFIG_CL_SLAVE_CORE_STACK_SIZE, and conf.cc_stack_size via a new -DCC_STACK_SIZE CMake option.

Changed

  • L3 tiling DMA: single-buffer→blocking, double-buffer→async (PULPL3Tiling optional dbDma, defaults to dma).
  • Temporary cluster fork/closure argument structs emitted static (off-stack) so the fork descriptor can't clobber them.

Fixed

  • GAP9 L3 board tests: no longer memcpy/CPU-deref HyperRAM pointers (IS_L1/IS_L2 gating) — was a board-only Invalid fetch.
  • GAP9 cluster (mchan) DMA: PerTensorWaitingStrategyDirectionWaitingStrategy leaked channels on multi-input tiles → mchan_transfer_wait hang.
  • L3 DMA future .size reset after wait (avoid double-wait).

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review. (docker not modified)

deeploytest.c classified memory by `ptr >= 0x10000000` (inputs) / `< 0x10000000`
(outputs). HyperRAM/L3 addresses (cl_ram_malloc) are also >= 0x10000000 but are
NOT CPU-addressable, so for `--defaultMemLevel L3` tests on real silicon main did
a raw memcpy / CPU-deref of an L3 pointer -> 'Invalid fetch' fault in main (e.g.
MatMul L3 on board: fault at the cl_ram_malloc'd input address). GVSoC models
HyperRAM as flat RAM so it passed there, masking the bug.

Add IS_L1/IS_L2 on-chip-window macros and use them:
- Inputs: only memcpy on-chip (IS_L2) inputs with a non-NULL testInputVector;
  L3 inputs are loaded from the readfs hex in InitNetwork (testInputVector is
  NULL) and already live in HyperRAM, so skip them.
- Outputs: ram_read L3 outputs into an L2 scratch before the compare (and free
  it); on-chip outputs compared in place. Paired malloc/free kept in sync.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256 (unchanged). On-chip (L2)
tests behave identically; only L3 paths change.
The L3<->L2 tiling used one DMA backend for both single- and double-buffering.
Async DMA only helps double-buffering (it overlaps the next-tile prefetch with
compute); single-buffering waits on each tile before computing, so async gives
SB no benefit but all the risk — strided 2D L3 transfers (pi_cl_ram_copy_2d) can
corrupt under deferred waits.

- PULPL3Tiling: add optional `dbDma` (defaults to `dma`) so SB and DB can use
  different backends. Backward compatible.
- GAP9 bindings: SB keeps the blocking gap9L3DmaHack; DB uses async GAP9L3Dma for
  real L3<->L2 prefetch overlap.
- GAP9L3Dma: reset future `.size`=0 after copy-wait (so a completed future isn't
  waited twice) and cast `${ext}` to uint32_t in the 2D transfer.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

PULPL3Tiling gains an optional dbDma parameter so DB tiling can use a different async DMA backend than SB tiling. Two code-generation bugs in GAP9L3Dma are fixed (descriptor size reset and uint32_t cast). Both GAP9 transformers are wired to pass GAP9L3Dma() as dbDma. The GAP9 test harness replaces hardcoded address thresholds with IS_L1/IS_L2 macros.

Changes

Async DB DMA engine split in PULPL3Tiling and GAP9 wiring

Layer / File(s) Summary
PULPL3Tiling optional dbDma parameter and wiring
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py
PULPL3Tiling.__init__ now accepts dbDma: Optional[AsyncDma] = None, defaulting to dma. SB/profilingSB use dma; DB/profilingDB use dbDma.
GAP9L3Dma codegen bug fixes
Deeploy/Targets/GAP9/DMA/L3Dma.py
_waitTemplate emits ${name}.size = 0 after pi_cl_ram_copy_wait. The direction-2 transfer template casts ${ext} to uint32_t.
GAP9Transformer and GAP9ClusterTransformer wiring
Deeploy/Targets/GAP9/Bindings.py
Imports GAP9L3Dma; both transformers now pass dbDma=GAP9L3Dma() to PULPL3Tiling, separating async DB transfers from blocking SB transfers.

IS_L1/IS_L2 macro-based address detection in GAP9 test harness

Layer / File(s) Summary
IS_L1/IS_L2 macros and updated input/output handling
DeeployTest/Platforms/GAP9/src/deeploytest.c
Defines IS_L1/IS_L2 macros as on-chip address window checks and replaces all hardcoded numeric address thresholds for input copy, output DMA-read scratch allocation, and compbuf deallocation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pulp-platform/Deeploy#143: Introduced the GAP9 DMA/platform infrastructure (GAP9L3Dma, GAP9L3DmaFuture, PULPL3Tiling) that this PR directly extends with the async DB DMA split and codegen fixes.

Suggested reviewers

  • Victor-Jung
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix GAP9 L3 Performance and Memory Transfer' accurately captures both main components of the changeset: L3 memory handling fixes and L3 tiling DMA performance optimization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is directly related to the changeset, covering L3 memory handling fixes, DMA optimizations, and test harness corrections that align with the code changes shown.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
DeeployTest/Platforms/GAP9/src/deeploytest.c (1)

178-184: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing NULL check on pi_l2_malloc may cause crash.

If L2 memory is exhausted, pi_l2_malloc returns NULL, and the subsequent ram_read and comparison will dereference NULL, causing a fault. Adding a check with an error message provides clearer diagnostics than an opaque crash.

🛡️ Proposed fix to handle allocation failure
     if (!IS_L2(DeeployNetwork_outputs[buf])) {
       compbuf = pi_l2_malloc(DeeployNetwork_outputs_bytes[buf]);
+      if (compbuf == NULL) {
+        printf("Error: failed to allocate L2 scratch for output %u (%u bytes)\r\n",
+               buf, DeeployNetwork_outputs_bytes[buf]);
+        return -1;
+      }
       ram_read(compbuf, DeeployNetwork_outputs[buf],
                DeeployNetwork_outputs_bytes[buf]);
     } else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@DeeployTest/Platforms/GAP9/src/deeploytest.c` around lines 178 - 184, The
pi_l2_malloc call may return NULL when L2 memory is exhausted, but the code does
not check for this failure before passing compbuf to ram_read, which will cause
a NULL pointer dereference. Add a NULL check immediately after the pi_l2_malloc
call (when IS_L2(DeeployNetwork_outputs[buf]) is false) and handle the error
case with an appropriate error message or failure path to prevent the crash.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@DeeployTest/Platforms/GAP9/src/deeploytest.c`:
- Around line 178-184: The pi_l2_malloc call may return NULL when L2 memory is
exhausted, but the code does not check for this failure before passing compbuf
to ram_read, which will cause a NULL pointer dereference. Add a NULL check
immediately after the pi_l2_malloc call (when IS_L2(DeeployNetwork_outputs[buf])
is false) and handle the error case with an appropriate error message or failure
path to prevent the crash.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31b0183f-da2c-45e2-99d4-f43e1f5d2bbe

📥 Commits

Reviewing files that changed from the base of the PR and between 15a7841 and db36d89.

📒 Files selected for processing (4)
  • Deeploy/Targets/GAP9/Bindings.py
  • Deeploy/Targets/GAP9/DMA/L3Dma.py
  • Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py
  • DeeployTest/Platforms/GAP9/src/deeploytest.c

@runwangdl runwangdl marked this pull request as draft June 19, 2026 00:05
…mple

- TargetLibraries/GAP9: compile the hot forward kernels (Convolution_fp32,
  DWConvolution_fp32, Gemm) at -O3, appended last so it wins over the SDK's
  default -Os. These dominate GAP9 inference cycles.
- deeploytest.c / CMake / sdk config: make the GAP9 example show the three
  L1-memory knobs that let conv-heavy nets fit, with explanatory comments:
    A. slave (PE) stacks -> L2: hand the cluster task a static L2 buffer
       (SET_SLAVE_STACK) so the SDK skips its L1 slave-stack alloc (~30 KB L1).
    B. shrink the SDK's L1 slave stacks via CONFIG_CL_SLAVE_CORE_STACK_SIZE
       (sdk_gvsoc.config) -- alternative to A.
    C. size the cluster-controller stack via conf.cc_stack_size, overridable
       from the build with -DCC_STACK_SIZE=<bytes> (new CMake option).

Verified MatMul --defaultMemLevel L3 -DCC_STACK_SIZE=8192 on GVSoC: 0/256.
…stack)

The tiling argument structs were stack-locals in the dispatching function. The
cluster fork runtime writes its descriptor near the top of the CC/master stack;
a stack-local arg struct placed there can be clobbered before the forked cores
read it (a GAP9 cluster-fork crash, e.g. MobileNetV1). Declare the struct
`static` and assign separately so it lives in static storage, stable across the
forked call.

Generic codegen (ArgumentStructGeneration); benign on other targets. Verified
MatMul --defaultMemLevel L3 on GVSoC: 0/256.
GAP9 mchan allocates a fresh channel on every descriptor enqueue. The previous
DirectionWaitingStrategy shares one future (one mchan_transfer_get_id) across all
same-direction tensors of a tile, so a tile with >1 input emits one get_id but
multiple pushes -> the extra transfers run on channels that are never waited or
freed -> mchan_transfer_wait() hangs (e.g. the optimizer weight+grad stall).
Switch to PerTensorWaitingStrategy so each tensor gets its own
get_id : push : wait : free, matching the mchan contract.

Verified MatMul --defaultMemLevel L3 on GVSoC: 0/256.
Explain each GAP9 backend change in this branch — L3-aware board harness, SB/DB
L3 DMA split, -O3 forward kernels, the three L1-memory knobs (cc_stack /
slave-stack size / slave-stack->L2), static cluster fork/closure args, and the
per-tensor mchan DMA waiting strategy — with problem, fix, file, and takeaway,
plus a short GAP9 memory-model primer.
Add gap9_memcheck.py and run it from run_complete_test after the build, before
the simulation, on GAP9. It models every consumer of L1/L2 the tiler doesn't
(CC master stack, PE slave stacks, ELF sections, tile arena, promoted pool) and
scans InitNetwork for the pi_l2_malloc-after-cl_ram_malloc alloc-order race, so
over-subscription fails fast with the exact knob instead of a multi-minute GVSoC
hang. GAP9-only; bypass with DEPLOY_SKIP_MEMCHECK=1. Verified MatMul L3: gate
runs (PASS) and test is 0/256.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant