Skip to content

fix: scan prefix-sum correctness, binding hazards, and GPU dispatch optimizations#21

Merged
LessUp merged 1 commit into
masterfrom
fix/scan-correctness-and-optimizations
Jul 1, 2026
Merged

fix: scan prefix-sum correctness, binding hazards, and GPU dispatch optimizations#21
LessUp merged 1 commit into
masterfrom
fix/scan-correctness-and-optimizations

Conversation

@LessUp

@LessUp LessUp commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

See commit message for full details. Fixes critical scan bugs (half-element add_block_prefixes, binding hazard, 512-block limit) and optimizes GPU dispatch.

…dispatch

Critical correctness fixes:
- Fix add_block_prefixes shader only processing half the elements (1 per
  thread instead of 2), which broke radix sort for arrays > 8192 elements.
  Now matches blelloch_scan's 2-elements-per-thread layout.
- Fix scan_block_sums WebGPU binding hazard: the same blockSumsBuffer was
  bound as both read-only-storage (binding 0) and read_write storage
  (binding 1/2), which is a validation error. Each pipeline now has a
  dedicated bind group layout with only the bindings it uses.
- Fix scan_block_sums 512-block limit: for arrays > ~4M elements the
  single-workgroup block-sum scan silently skipped excess blocks. Replaced
  with a recursive multi-level scan that handles arbitrarily large inputs.

Architecture improvements:
- ScanModule: three dedicated bind group layouts (scanLayout,
  blockSumsScanLayout, addPrefixesLayout) instead of one shared layout,
  eliminating all read-only/read-write binding conflicts.

Performance optimizations:
- BitonicSorter: batch all bitonic passes into a single command encoder
  with copyBufferToBuffer for uniform updates, reducing queue submissions
  from 100+ to 1 for large arrays.
- RadixSorter: reuse zero-histogram buffer across passes instead of
  allocating a new Uint32Array per pass.
- Benchmark: preallocate GPU buffers per target size so iterations measure
  steady-state sort performance (buffer reuse) not allocation overhead.
- fillRandomUint32Array: fill in-place via subarray instead of allocating
  and copying per-chunk temporary arrays.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@LessUp LessUp merged commit fd98cb7 into master Jul 1, 2026
1 check failed
@LessUp LessUp deleted the fix/scan-correctness-and-optimizations branch July 1, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant