[Common/PyTorch] Grouped-quantize kernels for 1D and 2D FP8 block-scaling#3135
[Common/PyTorch] Grouped-quantize kernels for 1D and 2D FP8 block-scaling#3135denera wants to merge 5 commits into
Conversation
Implements grouped-tensor quantize for the FP8 1D (1x128) and 2D (128x128)
block-scaling recipes. A single CUDA kernel launch walks 128x128 tiles
across every tensor in the group, with each CTA decoding its owning
tensor from the device-side GroupedTensor metadata.
Supported shape representations:
- SAME_BOTH_DIMS (all tensors identical)
- VARYING_FIRST_DIM (constant K, varying R - the common MoE topology)
Supported directions: rowwise-only, columnwise-only, and both.
These kernels are gated to Hopper (sm_90) at the host dispatcher because
the consumer cuBLAS FP8 block-scaling *grouped* GEMM is itself
Hopper-only (cuBLAS does not provide native FP8 block-scaling grouped
GEMM on Blackwell; the recommended quantization recipe on Blackwell is
MXFP8). The device-side kernel bodies are gated on __CUDA_ARCH__ >= 900
so the kernels compile and link as part of multi-arch builds, but the
host gate prevents launches on Blackwell.
Three kernels share the dispatcher in
group_quantize_blockwise_{1d,2d}:
| Kernel | Dispatched when | Threading | Smem |
|--------|-----------------|-----------|------|
| group_block_scaled_1d_rw_kernel | 1D RW-only | 8 threads/row x 32 row-warps x 4 iters; reads gmem directly into vec-16 registers | none |
| group_block_scaled_1d_tma_kernel | 1D CW or 1D BOTH | TMA bulk-load fills 32 KB input cache. BOTH runs RW pass first (8 t/row, vec-16) then CW pass (2 t/col, 64-row register stage); CW-only skips the RW pass. CW writes the transposed-FP8 tile to a 16.5 KB smem_T staging buffer, then drains to gmem. | 32 KB + 16.5 KB |
| group_block_scaled_2d_tma_kernel | 2D RW / CW / BOTH | TMA bulk-load fills 32 KB cache. Pass 1 stages 8 IVecs/thread in registers while computing the per-tile scalar amax. Pass 2 quantizes from registers, emits rowwise output, stages columnwise output to smem_T, then drains. | 32 KB + 16.5 KB |
The RW-only 1D path bypasses TMA because a streaming read has no reuse
- the smem round-trip and mbarrier overhead would just add latency.
The C++ test tests/cpp/operator/test_cast_float8blockwise_grouped.cu
exercises 72 configurations covering RW/CW/BOTH x 1D/2D x SAME/VARYING
shape representations against a per-tensor split-quantize reference.
Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
| constexpr int kThreadsPerBlock = 256; | ||
| constexpr int kNumWarps = kThreadsPerBlock / kThreadsPerWarp; | ||
|
|
||
| // Align a dynamic-smem pointer to 128 bytes (TMA requirement). |
There was a problem hiding this comment.
Could we reuse the existing align_smem_ptr_per_TMA_requirements() helper from transformer_engine/cast/core/common.h here?
| size_t total_row_blocks) { | ||
| using namespace transformer_engine::dispatch::mxfp8::swizzle; | ||
| const size_t num_tiles_X = | ||
| (total_row_blocks + GEMM_SWIZZLED_SCALE_TILE_DIM_X - 1) / GEMM_SWIZZLED_SCALE_TILE_DIM_X; |
There was a problem hiding this comment.
We can also reuse the existing DIVUP() helper here (defined in transformer_engin/common/common.h).
|
|
||
| // ---- Tensor-lookup helpers ---------------------------------------------------- | ||
|
|
||
| // Map a global tile-row index to its owning tensor by binary-searching |
There was a problem hiding this comment.
We can also reuse the existing get_current_tensor_id() helper defined in transformer_engine/cast/core/common.cuh
Greptile SummaryThis PR adds single-launch grouped quantize kernels for FP8 1D (1×128) and 2D (128×128) block-scaling recipes, supporting row-wise, column-wise, and BOTH quantization directions on Hopper (SM90). It also promotes the shared
Confidence Score: 5/5The change is self-contained new functionality on a Hopper-only code path with no modifications to existing quantize logic; the existing paths are unchanged. The three new kernels follow correct Hopper TMA patterns (mbarrier init → fence_proxy_async → arrive_expect_tx → cp_async_bulk_cta → wait_parity). Row/column bounds handling and the XOR swizzle write-read pairs are consistent. Lowering the PTX mbarrier guards from SM100 to SM90 is correct — these instructions are Hopper-native. The swizzle.cuh rename is purely mechanical and all callers are updated. Test coverage exercises SAME_BOTH_DIMS and VARYING_FIRST_DIM, all three quantization directions, and swizzled/non-swizzled scale layouts, comparing against per-tensor nvte_quantize_v2 as ground truth. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[group_quantize PyTorch] --> B{Quantizer type?}
B -- Float8BlockwiseQuantizers --> C[FP8_BLOCKWISE_GROUPED_QUANTIZE]
C --> D[nvte_group_quantize]
D --> E{scaling_mode}
E -- NVTE_BLOCK_SCALING_1D --> F[group_quantize_blockwise_1d]
E -- NVTE_BLOCK_SCALING_2D --> G[group_quantize_blockwise_2d]
F --> H{use_rowwise only?}
H -- Yes --> I[group_block_scaled_1d_rw_kernel\nNo TMA · vec-16 gmem loads\n8 threads/row]
H -- No --> J[group_block_scaled_1d_tma_kernel\nTMA bulk-load → smem cache\nRW pass 8t/row + CW pass 2t/col]
G --> K[group_block_scaled_2d_tma_kernel\nTMA bulk-load → smem cache\nPass1 reg-stage amax\nPass2 quantize RW+CW]
J --> L{kSwizzledScales?}
K --> L
L -- Yes --> M[gemm_swizzled_scale_idx\nfor cuBLAS TN GEMM]
L -- No --> N[flat transposed layout]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[group_quantize PyTorch] --> B{Quantizer type?}
B -- Float8BlockwiseQuantizers --> C[FP8_BLOCKWISE_GROUPED_QUANTIZE]
C --> D[nvte_group_quantize]
D --> E{scaling_mode}
E -- NVTE_BLOCK_SCALING_1D --> F[group_quantize_blockwise_1d]
E -- NVTE_BLOCK_SCALING_2D --> G[group_quantize_blockwise_2d]
F --> H{use_rowwise only?}
H -- Yes --> I[group_block_scaled_1d_rw_kernel\nNo TMA · vec-16 gmem loads\n8 threads/row]
H -- No --> J[group_block_scaled_1d_tma_kernel\nTMA bulk-load → smem cache\nRW pass 8t/row + CW pass 2t/col]
G --> K[group_block_scaled_2d_tma_kernel\nTMA bulk-load → smem cache\nPass1 reg-stage amax\nPass2 quantize RW+CW]
J --> L{kSwizzledScales?}
K --> L
L -- Yes --> M[gemm_swizzled_scale_idx\nfor cuBLAS TN GEMM]
L -- No --> N[flat transposed layout]
Reviews (3): Last reviewed commit: "Move GEMM-swizzled scale helpers out of ..." | Re-trigger Greptile |
| } | ||
|
|
||
| CType amax = compute_row_amax<IType, CType, kVec>(in_vec[it]); | ||
| amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 1)); |
There was a problem hiding this comment.
Could we reuse the existing amax warp-reduction helpers (warp_reduce_max() or reduce_max()) from transformer_engine/common/utils.cuh here?
| amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 1)); | ||
| amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 2)); | ||
| amax = fmaxf(amax, __shfl_xor_sync(0xffffffff, amax, 4)); |
There was a problem hiding this comment.
We can also reuse reduce_max() or warp_reduce_max() here.
|
|
||
| // ----- Host-side dispatchers -------------------------------------------------------------------- | ||
|
|
||
| inline size_t align_up_to(size_t x, size_t a) { return ((x + a - 1) / a) * a; } |
There was a problem hiding this comment.
We can reuse DIVUP_TO_MULTIPLE() defined in transformer_engine/common/common.h.
| NVTE_CHECK(info.tensor_offsets_d != nullptr, | ||
| "VARYING_FIRST_DIM requires tensor_offsets to be set on the GroupedTensor."); | ||
| } | ||
| info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim; |
There was a problem hiding this comment.
| info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim; | |
| info.total_row_blocks = DIVUP(info.R_total, kTileDim); |
| "VARYING_FIRST_DIM requires tensor_offsets to be set on the GroupedTensor."); | ||
| } | ||
| info.total_row_blocks = (info.R_total + kTileDim - 1) / kTileDim; | ||
| info.blocks_X = (info.K + kTileDim - 1) / kTileDim; |
There was a problem hiding this comment.
| info.blocks_X = (info.K + kTileDim - 1) / kTileDim; | |
| info.blocks_X = DIVUP(info.K, kTileDim); |
| info.same_both_dims = same_both_dims; | ||
| info.num_tensors = output->num_tensors; | ||
| info.K = output->get_common_last_dim(); | ||
| NVTE_CHECK(info.K % 16 == 0, "Last dim must be multiple of 16 (FP8 alignment)."); |
There was a problem hiding this comment.
If this is a TMA requirement, we can use the TMA_GMEM_ALIGNMENT constant defined in transformer_engine/common/common.h
| const float* noop_ptr = | ||
| (noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr; | ||
|
|
||
| const size_t scale_stride_y = align_up_to(info.blocks_X, 4); |
There was a problem hiding this comment.
| const size_t scale_stride_y = align_up_to(info.blocks_X, 4); | |
| const size_t scale_stride_y = DIVUP_TO_MULTIPLE(info.blocks_X, 4); |
| const size_t scale_stride_y = align_up_to(info.blocks_X, 4); | ||
| // CW scales are stored [blocks_X, align4(total_row_blocks)] -- transposed to | ||
| // match the physically-transposed columnwise data the TN cuBLAS GEMM consumes. | ||
| const size_t scale_t_stride_y = align_up_to(info.total_row_blocks, 4); |
There was a problem hiding this comment.
| const size_t scale_t_stride_y = align_up_to(info.total_row_blocks, 4); | |
| const size_t scale_t_stride_y = DIVUP_TO_MULTIPLE(info.total_row_blocks, 4); |
| const float* noop_ptr = | ||
| (noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr; | ||
|
|
||
| const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4); |
There was a problem hiding this comment.
| const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4); | |
| const size_t scale_stride_aligned_R = DIVUP_TO_MULTIPLE(info.R_total, 4); |
| (noop != nullptr) ? reinterpret_cast<const float*>(noop->data.dptr) : nullptr; | ||
|
|
||
| const size_t scale_stride_aligned_R = align_up_to(info.R_total, 4); | ||
| const size_t scale_t_stride_aligned_K = align_up_to(info.K, 4); |
There was a problem hiding this comment.
| const size_t scale_t_stride_aligned_K = align_up_to(info.K, 4); | |
| const size_t scale_t_stride_aligned_K = DIVUP_TO_MULTIPLE(info.K, 4); |
- Reuse shared helpers (DIVUP, DIVUP_TO_MULTIPLE, TMA_GMEM_ALIGNMENT, align_smem_ptr_per_TMA_requirements, get_current_tensor_id, subwarp_reduce_max_broadcast) in place of local equivalents. - Add proxy-async fence after mbarrier_init in 2D + 1D TMA kernels. - Enforce per-tensor first_dim % 128 device-side for VARYING_FIRST_DIM (matches MXFP8 grouped quantize behavior). - Fix Hopper SM range wording in 1D dispatcher. - Extend cpp tests to cover with_gemm_swizzled_scales path. Signed-off-by: Alp Dener <adener@nvidia.com>
for more information, see https://pre-commit.ci
| // num_tiles_X = DIVUP(total_row_blocks, TILE_DIM_X=4) | ||
| __device__ __forceinline__ size_t swizzled_colwise_scale_idx(size_t i, size_t j, | ||
| size_t total_row_blocks) { | ||
| using namespace transformer_engine::dispatch::mxfp8::swizzle; |
There was a problem hiding this comment.
I think we should rename the namespace for swizzle...given that we use the same constants for mxfp8, nvfp4, fp8 block scaling
The swizzle helpers are shared across MXFP8, NVFP4, and FP8 block scaling. Relocate swizzle.cuh from cast/mxfp8/ to cast/ and drop the mxfp8:: namespace layer so callers don't reach across precisions. Signed-off-by: Alp Dener <adener@nvidia.com>
Description
Implements grouped-tensor quantize for the FP8 1D (1x128) and 2D (128x128) block-scaling recipes in row-wise (RW), column-wise (CW) and BOTH quantization directions. A single CUDA kernel launch walks 128x128 tiles across every tensor in the group, with each CTA decoding its owning tensor from the device-side GroupedTensor metadata with (N, R, K) shapes. Supports
SAME_BOTH_DIMS(all tensors identical) andVARYING_FIRST_DIM(constant K, varying R) shape representations.Three kernels share the dispatcher in
group_quantize_blockwise_{1d,2d}:group_block_scaled_1d_rw_kernel— RW-only dispatch; 8 threads/row, reads global memory directly into vec-16 registers; bypasses TMA because the shared memory roundtrip andptx::mbarrierdoes not buy anything without re-use in CW path.group_block_scaled_1d_tma_kernel— CW-only and BOTH dispatch; TMA bulk-load fills shared memory input cache. BOTH runs RW pass first (8 threads/row, vec-16 read from shared memory) then CW pass (2 threads/column, 64-row register stage); CW-only skips the RW pass. CW path writes the transposed-FP8 tile to a shared memory transpose staging buffer, then drains to global memory.group_block_scaled_2d_tma_kernel— RW-only, CW-only and BOTH dispatch; TMA bulk-load fills shared memory input cache. Pass 1 stages 8 IVecs/thread in registers while computing the per-tile scalar amax. Pass 2 quantizes from registers, emits row-wise output, stages column-wise output to shared memory transpose staging buffer, then drains to global memory.Kernels are gated to Hopper (sm_90) at the host dispatcher (cuBlasLt grouped GEMM supports FP8 block-scaling only on Hopper).
PR includes PyTorch integration.
JAX integration is intentionally left out-of-scope and deferred to a follow-up PR because it requires non-trivial new scaffolding on the framework side.
Resolves #2525
Performance
Table below measures performance on H200 with a sweep of grouped tensors in (N, M, K) shapes with:
The shapes are split into two buckets:
Reported kernel times and throughput ratios are bucket medians.
Speedup is measured relative to the split-quantized fallback that loops over the grouped tensor and sequentially quantizes each one.
% of "mono" throughput is measured relative to the throughput of a single non-grouped FP8 block-scaling quantize kernel invoked with the equivalent monolithic (NxM, K) tensor where the # of experts are collapsed with # of tokens/expert.
Notes
Known Sub-Optimalities
1D CW has bank conflicts on ~35% of load wavefronts (reading from the shared memory input-cache)
CU_TENSOR_MAP_SWIZZLE_128Bhas the right pattern but caps FP16/BF16 at 64-elements; does not fit the 128-element tile for FP8 block-scaling without doubling per-tile launch overhead (quadrupling for FP32).1D BOTH reads the shared memory input-cache twice
2D CW/BOTH has bank conflicts on ~16% of store wavefronts (when writing to the shared memory transpose buffer)
No TMA-store
Type of change
Checklist: