[Common] Support scaled & clamped swiglu, srelu for BF16 by zhongbozhu · Pull Request #3132 · NVIDIA/TransformerEngine

zhongbozhu · 2026-06-16T07:12:29Z

Description

Support Mega-C++ with Cublas BF16 Grouped GEMM backend: #3099

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: zhongboz <zhongboz@nvidia.com>

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-16T07:37:40Z

Greptile Summary

This PR adds six new CUDA kernels (scaled_activation.cu) that fuse a per-row activation scale into SwiGLU, ClampedSwiGLU, and SReLU forward and backward passes, along with a corresponding public C API in activation.h and a parametrized GTest suite.

Kernel design: Forward kernels read act/gate streams in vectorized row segments (with GLU-interleave support), multiply the activation output by act_scales[row], and store in the target dtype in a single pass. Backward kernels have two code paths — a flat element-wise grid when grad_act_scales is null, and a one-block-per-row warp-reduction path when the per-row scale gradient must be accumulated.
Math correctness: The SiLU, ClampedSiLU, and SReLU forward/backward formulas in the kernels match their reference implementations in util/math.h and the test reference functions; the block reduction logic is correct.
Minor cleanup items: A redundant gated_unscaled call in the test reference and a dead Empty variable in nvte_scaled_swiglu are left in; FP16 is absent from the test dtype sweep despite being covered by the dispatch macro; and the one-block-per-row kernel launch casts rows (a size_t) to int for the grid dimension.

Confidence Score: 4/5

Safe to merge; the new kernels are mathematically consistent with the existing utility functions and the test suite covers the primary code paths for both contiguous and interleaved GLU layouts.

The core kernel math, alignment dispatch, and block reduction are correct. The only items worth addressing before shipping are: FP16 is absent from the test dtype sweep even though the dispatch macro includes it, the one-block-per-row launch casts size_t rows to int (wrap-around for multi-billion-token batches), one dead call in the test reference, and one dead variable in the API wrapper. None of these affect correctness for typical workloads.

scaled_activation.cu (the rows cast in the reduction kernel launch) and test_scaled_activation.cu (missing FP16 coverage and dead reference call).

Important Files Changed

Filename	Overview
transformer_engine/common/activation/scaled_activation.cu	New 781-line CUDA file implementing 6 kernels (scaled forward/backward for SwiGLU, ClampedSwiGLU, SReLU) with vectorized loads, interleaved GLU layout support, and optional per-row scale-gradient reduction; includes dead `Empty` variable and a `size_t`→`int` cast for the grid-launch block count.
tests/cpp/operator/test_scaled_activation.cu	New 321-line test file with a parametrized GTest suite covering forward+backward for all three activations and both interleave modes; has a redundant `gated_unscaled` call in the reference and is missing `kFloat16` from the dtype sweep.
transformer_engine/common/include/transformer_engine/activation.h	Adds public C API declarations for 6 new scaled-activation functions with well-documented Doxygen comments; no issues found.
transformer_engine/common/CMakeLists.txt	Registers `scaled_activation.cu` in both the standard and fast-math source lists; straightforward and correct.
tests/cpp/operator/CMakeLists.txt	Adds `test_scaled_activation.cu` to the `test_operator` executable; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    API_FWD["nvte_scaled_swiglu /\nnvte_scaled_clamped_swiglu /\nnvte_scaled_srelu"]
    API_BWD["nvte_scaled_dswiglu /\nnvte_scaled_clamped_dswiglu /\nnvte_scaled_dsrelu"]

    API_FWD --> CHK_GATED_FWD{Gated?}
    CHK_GATED_FWD -- "SwiGLU / ClampedSwiGLU" --> ALIGN_FWD[check alignment & segment layout]
    CHK_GATED_FWD -- "SReLU" --> ALIGN_SRELU_FWD[check alignment]

    ALIGN_FWD -- "aligned" --> KFG_VEC["scaled_gated_forward_kernel nvec>1"]
    ALIGN_FWD -- "unaligned" --> KFG_SCAL["scaled_gated_forward_kernel nvec=1"]
    ALIGN_SRELU_FWD -- "aligned" --> KSF_VEC["scaled_srelu_forward_kernel nvec>1"]
    ALIGN_SRELU_FWD -- "unaligned" --> KSF_SCAL["scaled_srelu_forward_kernel nvec=1"]

    API_BWD --> CHK_GATED_BWD{Gated?}
    CHK_GATED_BWD -- "SwiGLU / ClampedSwiGLU" --> CHK_SCALE_G[grad_act_scales?]
    CHK_GATED_BWD -- "SReLU" --> CHK_SCALE_S[grad_act_scales?]

    CHK_SCALE_G -- "null" --> KGB_FLAT["scaled_gated_backward_kernel flat grid"]
    CHK_SCALE_G -- "present" --> KGB_RED["scaled_gated_backward_with_scale_grad_kernel one block per row + warp reduction"]

    CHK_SCALE_S -- "null" --> KSB_FLAT["scaled_srelu_backward_kernel flat grid"]
    CHK_SCALE_S -- "present" --> KSB_RED["scaled_srelu_backward_with_scale_grad_kernel one block per row + warp reduction"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    API_FWD["nvte_scaled_swiglu /\nnvte_scaled_clamped_swiglu /\nnvte_scaled_srelu"]
    API_BWD["nvte_scaled_dswiglu /\nnvte_scaled_clamped_dswiglu /\nnvte_scaled_dsrelu"]

    API_FWD --> CHK_GATED_FWD{Gated?}
    CHK_GATED_FWD -- "SwiGLU / ClampedSwiGLU" --> ALIGN_FWD[check alignment & segment layout]
    CHK_GATED_FWD -- "SReLU" --> ALIGN_SRELU_FWD[check alignment]

    ALIGN_FWD -- "aligned" --> KFG_VEC["scaled_gated_forward_kernel nvec>1"]
    ALIGN_FWD -- "unaligned" --> KFG_SCAL["scaled_gated_forward_kernel nvec=1"]
    ALIGN_SRELU_FWD -- "aligned" --> KSF_VEC["scaled_srelu_forward_kernel nvec>1"]
    ALIGN_SRELU_FWD -- "unaligned" --> KSF_SCAL["scaled_srelu_forward_kernel nvec=1"]

    API_BWD --> CHK_GATED_BWD{Gated?}
    CHK_GATED_BWD -- "SwiGLU / ClampedSwiGLU" --> CHK_SCALE_G[grad_act_scales?]
    CHK_GATED_BWD -- "SReLU" --> CHK_SCALE_S[grad_act_scales?]

    CHK_SCALE_G -- "null" --> KGB_FLAT["scaled_gated_backward_kernel flat grid"]
    CHK_SCALE_G -- "present" --> KGB_RED["scaled_gated_backward_with_scale_grad_kernel one block per row + warp reduction"]

    CHK_SCALE_S -- "null" --> KSB_FLAT["scaled_srelu_backward_kernel flat grid"]
    CHK_SCALE_S -- "present" --> KSB_RED["scaled_srelu_backward_with_scale_grad_kernel one block per row + warp reduction"]

Comments Outside Diff (3)

tests/cpp/operator/test_scaled_activation.cu, line 326-327 (link)

Missing FP16 dtype in test matrix

The implementation dispatches through TRANSFORMER_ENGINE_TYPE_SWITCH_NON_FP8ONLY, which covers float32, float16, and bfloat16. The test matrix only exercises kFloat32 and kBFloat16, leaving kFloat16 untested for both data and scale tensors. An off-by-one in the vector-width calculation or a narrow-type saturation edge case specific to FP16 would pass the current suite undetected.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
transformer_engine/common/activation/scaled_activation.cu, line 1087-1093 (link)

Dead Empty variable in nvte_scaled_swiglu

Empty empty = {}; (void)empty; is never passed to any function in this API wrapper — the Empty type is used only inside the CUDA kernels via silu<float, float>(act_in, empty). The declaration and the suppression cast are both dead code and can be removed.
transformer_engine/common/activation/scaled_activation.cu, line 994-1005 (link)

static_cast<int>(rows) may overflow for very large grid launches

The "with scale grad" kernels are launched with <<<static_cast<int>(rows), kReductionThreads, ...>>>. rows is size_t; casting it directly to int silently wraps around for values above INT_MAX (~2.1 billion), producing a negative or near-zero block count and silently writing garbage to grad_act_scales. Using dim3(rows) (which uses the unsigned grid-dimension type) or guarding with NVTE_CHECK(rows <= INT32_MAX, ...) before the launch avoids the truncation. The same pattern appears in launch_scaled_srelu_backward.

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-06-16T07:37:44Z

+    }
+  }


Dead gated_unscaled call

gated_unscaled computes unscaled on line 170, but gated_grads unconditionally writes *unscaled on line 171, overwriting it. The first call is dead code — every gated_grads case sets *unscaled before returning, so the result of gated_unscaled is never observed. This should simply be removed.

either out-dated or not precise comment

You're right that "dead code" was imprecise. The more accurate characterization: gated_unscaled is a redundant computation — its return value is assigned to unscaled, but gated_grads immediately overwrites *unscaled in all three switch cases before returning, so the first assignment is discarded. The function itself is reachable and executes, but its result is never observed.

The substance still holds: the call to gated_unscaled on line 170 can be removed without any behavioral change, since gated_grads on line 171 already computes and sets unscaled as a side-output needed for the scale gradient. Keeping both creates a mild readability issue — a reader might assume the two calls serve different purposes.

A cleaner form would be:

Suggested change

}

}

gated_grads(activation, act_in, linear_in, &dact, &dlinear, &unscaled);

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

zhongbozhu · 2026-06-16T08:12:04Z

/te-ci pytorch

vthumbe1503

LGTM. Left minor comments

vthumbe1503 · 2026-06-23T19:59:21Z

+__device__ __forceinline__ float warp_reduce_sum(float value) {
+#pragma unroll
+  for (int offset = THREADS_PER_WARP / 2; offset > 0; offset >>= 1) {
+    value += __shfl_down_sync(0xffffffff, value, offset);
+  }
+  return value;
+}
+
+__device__ __forceinline__ float block_reduce_sum(float value, float *smem) {
+  const int lane = threadIdx.x % THREADS_PER_WARP;
+  const int warp = threadIdx.x / THREADS_PER_WARP;
+
+  value = warp_reduce_sum(value);


I beleive we can reuse this from utils.cuh

TransformerEngine/transformer_engine/common/utils.cuh

Lines 491 to 492 in 77054fa

inline __device__ T reduce(T data, const Op &op) {

// only lane 0 holds the result!

vthumbe1503 · 2026-06-23T20:06:16Z

+void nvte_scaled_swiglu(const NVTETensor input, const NVTETensor act_scales, NVTETensor output,
+                        int64_t glu_interleave_size, cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_swiglu);
+  using namespace transformer_engine;
+  Empty empty = {};
+  (void)empty;
+  ClampedSwiGLUParam param = {};
+  launch_scaled_gated_forward<ScaledActivation::kSwiGLU>(
+      input, act_scales, output, glu_interleave_size, param, stream, "nvte_scaled_swiglu");
+}
+
+void nvte_scaled_dswiglu(const NVTETensor grad, const NVTETensor input, const NVTETensor act_scales,
+                         NVTETensor grad_input, NVTETensor grad_act_scales,
+                         int64_t glu_interleave_size, cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_dswiglu);
+  using namespace transformer_engine;
+  ClampedSwiGLUParam param = {};
+  launch_scaled_gated_backward<ScaledActivation::kSwiGLU>(grad, input, act_scales, grad_input,
+                                                          grad_act_scales, glu_interleave_size,
+                                                          param, stream, "nvte_scaled_dswiglu");
+}
+
+void nvte_scaled_clamped_swiglu(const NVTETensor input, const NVTETensor act_scales,
+                                NVTETensor output, float limit, float alpha,
+                                float glu_linear_offset, int64_t glu_interleave_size,
+                                cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_clamped_swiglu);
+  using namespace transformer_engine;
+  ClampedSwiGLUParam param = {limit, alpha, glu_linear_offset};
+  launch_scaled_gated_forward<ScaledActivation::kClampedSwiGLU>(
+      input, act_scales, output, glu_interleave_size, param, stream, "nvte_scaled_clamped_swiglu");
+}
+
+void nvte_scaled_clamped_dswiglu(const NVTETensor grad, const NVTETensor input,
+                                 const NVTETensor act_scales, NVTETensor grad_input,
+                                 NVTETensor grad_act_scales, float limit, float alpha,
+                                 float glu_linear_offset, int64_t glu_interleave_size,
+                                 cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_clamped_dswiglu);
+  using namespace transformer_engine;
+  ClampedSwiGLUParam param = {limit, alpha, glu_linear_offset};
+  launch_scaled_gated_backward<ScaledActivation::kClampedSwiGLU>(
+      grad, input, act_scales, grad_input, grad_act_scales, glu_interleave_size, param, stream,
+      "nvte_scaled_clamped_dswiglu");
+}
+
+void nvte_scaled_srelu(const NVTETensor input, const NVTETensor act_scales, NVTETensor output,
+                       cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_srelu);
+  using namespace transformer_engine;
+  launch_scaled_srelu_forward(input, act_scales, output, stream, "nvte_scaled_srelu");
+}
+
+void nvte_scaled_dsrelu(const NVTETensor grad, const NVTETensor input, const NVTETensor act_scales,
+                        NVTETensor grad_input, NVTETensor grad_act_scales, cudaStream_t stream) {
+  NVTE_API_CALL(nvte_scaled_dsrelu);
+  using namespace transformer_engine;
+  launch_scaled_srelu_backward(grad, input, act_scales, grad_input, grad_act_scales, stream,
+                               "nvte_scaled_dsrelu");
+}


Might be good to move these NVTE API definitions into new files scaled_swiglu.cu and scaled_srelu.cu, following the footsteps of other activation definitions.

vthumbe1503 · 2026-06-23T20:35:41Z

+  const auto compute_grad_scales = std::get<5>(GetParam());
+
+  if (activation == ScaledActivationCase::kSReLU && interleave != 0) {
+    GTEST_SKIP() << "SReLU is not a GLU activation.";


Nit:

Suggested change

GTEST_SKIP() << "SReLU is not a GLU activation.";

GTEST_SKIP() << "Interleave has no meaning for SReLU.";

zhongbozhu added 5 commits June 12, 2026 17:28

support scaled swiglu, scaled srelu and scaled clamp swiglu

18d4d2c

Signed-off-by: zhongboz <zhongboz@nvidia.com>

vectorized loading improvement

953c469

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

fix bug for backward kernel

e3ae293

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

optimize

84cbdec

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

fix unit test failure

c73c8ea

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 16, 2026

zhongbozhu mentioned this pull request Jun 16, 2026

Introduce Mega-C++ to reduce CPU overhead #3099

Open

17 tasks

[pre-commit.ci] auto fixes from pre-commit.com hooks

3eb18a6

for more information, see https://pre-commit.ci

zhongbozhu marked this pull request as ready for review June 16, 2026 07:32

zhongbozhu requested a review from ptrendx as a code owner June 16, 2026 07:32

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

vthumbe1503 reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Support scaled & clamped swiglu, srelu for BF16 #3132

[Common] Support scaled & clamped swiglu, srelu for BF16 #3132
zhongbozhu wants to merge 6 commits into
NVIDIA:mainfrom
zhongbozhu:add_support_fused_swiglu

zhongbozhu commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot Jun 16, 2026

Uh oh!

zhongbozhu Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

zhongbozhu commented Jun 16, 2026

Uh oh!

vthumbe1503 left a comment

Uh oh!

vthumbe1503 Jun 23, 2026

Uh oh!

vthumbe1503 Jun 23, 2026

Uh oh!

vthumbe1503 Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	}
	}
	gated_grads(activation, act_in, linear_in, &dact, &dlinear, &unscaled);

	inline __device__ T reduce(T data, const Op &op) {
	// only lane 0 holds the result!

	GTEST_SKIP() << "SReLU is not a GLU activation.";
	GTEST_SKIP() << "Interleave has no meaning for SReLU.";

Conversation

zhongbozhu commented Jun 16, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Jun 16, 2026

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading