NVIDIA · shangxiaokang · Apr 15, 2026 · Apr 15, 2026 · Apr 15, 2026 · Apr 20, 2026
diff --git a/docs/debug/1_getting_started.rst b/docs/debug/1_getting_started.rst
@@ -149,10 +149,11 @@ Inspecting the logs
 -------------------
 
 
-Let's look at the files with the logs. Two files will be created:
+Let's look at the files with the logs. At least two files will be created:
 
 1. debug logs.
 2. statistics logs.
+3. optional feature-specific logs (for example AutoswitchGemm metrics).
 
 Let's look inside them!
 
@@ -214,6 +215,86 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
     INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000004                  value=0.9996
     INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969
 
+AutoswitchGemm quick guide
+--------------------------
+
+``AutoswitchGemm`` monitors quantization quality and can dynamically switch selected GEMMs
+to high precision when thresholds are exceeded. It supports the normal FP8 paths as well
+as block-scaled formats such as FP8 blockwise and MXFP8, as long as the selected TE module
+routes the GEMM through the AutoswitchGemm runtime hooks.
+
+Example config matching attention and MLP linears:
+
+.. code-block:: yaml
+
+    log_tensor_stats_all:
+      enabled: True
+      layers:
+        layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
+      transformer_engine:
+        LogTensorStats:
+          enabled: True
+          stats: [max, min, mean, std, dynamic_range, cur_amax]
+          tensors: [activation, gradient, weight]
+          freq: 10
+          start_step: 10
+        AutoswitchGemm:
+          enabled: True
+          gemms: [fprop, dgrad, wgrad]
+          tensors: [activation, weight, gradient]
+          underflow_threshold_pct: 5
+          mse_threshold: 0.1
+          allow_fp8_model_params_dequantized_weight: True
+          direct_high_precision_in_hold_window: True
+          freq: 10
+          start_step: 10
+
+Behavior summary:
+
+1. For each ``(layer, gemm)``, AutoswitchGemm tracks the latest tensor metrics and applies
+   OR logic across monitored tensors: if any tensor breaches thresholds, that GEMM switches.
+2. Sampling is controlled by ``start_step``, ``end_step`` / ``start_end_list``, and
+   ``freq``. For example, ``start_step: 10`` and ``freq: 10`` samples at steps
+   10, 20, 30, ...
+3. A threshold breach at sampling step ``n`` keeps the affected ``(layer, gemm)`` in
+   high precision through ``n + freq - 1``. The next sampling step refreshes the
+   decision; if thresholds are not breached, the GEMM returns to quantized execution.
+4. If model parameters are stored in a quantized format, set
+   ``allow_fp8_model_params_dequantized_weight: True`` to allow ``fprop`` and
+   ``dgrad`` to switch by using temporary dequantized weights.
+5. Set ``direct_high_precision_in_hold_window: True`` to directly select
+   high-precision tensor plans on non-sampling hold-window iterations. This
+   bypasses runtime quantize->dequantize conversion when high-precision source
+   tensors are available.
+6. When CUDA Graphs are used, sampling and high-precision windows must run in eager
+   mode. Quantized windows can continue using CUDA Graphs if the training framework
+   supports this routing. Megatron-LM support for this workflow depends on the
+   ``autogemm`` branch:
+   https://github.com/shangxiaokang/Megatron-LM/tree/autogemm
+
+When AutoswitchGemm is enabled, an additional directory is created under ``log_dir``:
+
+``nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log``
+
+It contains per-rank, per-iteration metrics such as:
+
+- ``<layer>_<gemm>_<tensor>_underflow_pct``
+- ``<layer>_<gemm>_<tensor>_mse``
+- ``<layer>_<gemm>_quantized_enabled``
+- ``<layer>_<gemm>_disable_until_iter``
+- ``<layer>_<gemm>_switch_blocked_fp8_model_params``
+- ``<layer>_<gemm>_fp8_model_params_dequantized_fallback``
+- ``<layer>_<gemm>_final_decision`` with fields such as
+  ``requested_precision``, ``precision``, ``lhs_quantized``, and ``rhs_quantized``.
+
+A typical Megatron-LM launch exports the debug config and log directory:
+
+.. code-block:: bash
+
+   export ENABLE_NVDFW_INSPECT=1
+   export NVDFW_CONFIG_FILE=/path/to/nvdlfw_inspect_30b.yaml
+   export NVDFW_LOG_DIR=/path/to/output/nvdlfw_logs
+
 Logging using TensorBoard
 -------------------------
 

diff --git a/docs/debug/2_config_file_structure.rst b/docs/debug/2_config_file_structure.rst
@@ -220,6 +220,64 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
             tensor_feature_param2: value
           gemm_feature_param1: value
 
+AutoswitchGemm notes
+--------------------
+
+``AutoswitchGemm`` supports both global and per-GEMM configuration.
+
+- Use ``gemms: [...]`` for one shared policy.
+- Use ``gemms_struct`` to set per-GEMM thresholds.
+
+If ``tensors``/``tensors_struct`` are omitted, monitored tensors are inferred from GEMMs:
+
+- ``fprop`` -> ``activation``, ``weight``
+- ``dgrad`` -> ``gradient``, ``weight``
+- ``wgrad`` -> ``activation``, ``gradient``
+
+Other important keys:
+
+- ``underflow_threshold_pct``: switch trigger based on underflow percentage.
+- ``mse_threshold``: switch trigger based on quantization MSE.
+- ``freq``: sampling interval. A sampled threshold breach at iteration ``n`` keeps
+  that ``(layer, gemm)`` in high precision through ``n + freq - 1``.
+- ``start_step`` / ``end_step`` / ``start_end_list``: sampling windows. If ``end_step``
+  is omitted, sampling continues according to ``freq`` after ``start_step``.
+- ``allow_fp8_model_params_dequantized_weight``: allows ``fprop``/``dgrad`` switching
+  for layers with quantized model parameters by using temporary dequantized weights.
+- ``AutoswitchGemm`` should use the same ``freq`` / sampling window as companion
+  tensor-inspection features such as ``LogTensorStats`` when they share the same
+  layers and tensors.
+
+Example for attention and MLP linear layers:
+
+.. code-block:: yaml
+
+    log_tensor_stats_all:
+      enabled: True
+      layers:
+        layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
+      transformer_engine:
+        LogTensorStats:
+          enabled: True
+          stats: [max, min, mean, std, dynamic_range, cur_amax]
+          tensors: [activation, gradient, weight]
+          freq: 10
+          start_step: 10
+        AutoswitchGemm:
+          enabled: True
+          gemms: [fprop, dgrad, wgrad]
+          tensors: [activation, weight, gradient]
+          underflow_threshold_pct: 5
+          mse_threshold: 0.1
+          allow_fp8_model_params_dequantized_weight: True
+          freq: 10
+          start_step: 10
+
+For CUDA Graph training, sampling and high-precision windows must be executed in eager
+mode. Quantized windows may continue to use CUDA Graphs if the training framework routes
+them separately. The Megatron-LM integration used by this example depends on:
+https://github.com/shangxiaokang/Megatron-LM/tree/autogemm
+
 Enabling or Disabling Sections and Features
 -------------------------------------------
 

diff --git a/docs/debug/3_api_features.rst b/docs/debug/3_api_features.rst
@@ -10,6 +10,7 @@ Debug features
 .. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
 .. autoapiclass:: transformer_engine.debug.features.log_nvfp4_tensor_stats.LogNvfp4TensorStats
 .. autoapiclass:: transformer_engine.debug.features.disable_quantization_gemm.DisableQuantizationGEMM
+.. autoapiclass:: transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
 .. autoapiclass:: transformer_engine.debug.features.disable_quantization_layer.DisableQuantizationLayer
 .. autoapiclass:: transformer_engine.debug.features.per_tensor_scaling.PerTensorScaling
 .. autoapiclass:: transformer_engine.debug.features.fake_quant.FakeQuant

diff --git a/docs/debug/autoswitch_gemm_example.yaml b/docs/debug/autoswitch_gemm_example.yaml
@@ -0,0 +1,60 @@
+# Example config for transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
+#
+# Usage:
+#   import nvdlfw_inspect.api as debug_api
+#   debug_api.initialize(
+#       config_file="docs/debug/autoswitch_gemm_example.yaml",
+#       feature_dirs=["transformer_engine/debug/features"],
+#       log_dir="./log",
+#   )
+#   ...
+#   debug_api.step()  # call once per training step
+
+log_tensor_stats_all:
+  enabled: True
+  layers:
+    # Names may be inferred by Megatron/TE. This matches attention linears and
+    # common MLP/MoE linears used by Qwen3-style models.
+    layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
+  transformer_engine:
+    LogTensorStats:
+      enabled: True
+      stats: [max, min, mean, std, dynamic_range, cur_amax]
+      tensors: [activation, gradient, weight]
+      # Match AutoswitchGemm's schedule when both features share the same
+      # inspect_tensor_enabled API calls.
+      freq: 10
+      start_step: 10
+
+    AutoswitchGemm:
+      enabled: True
+
+      # Enable all GEMM paths. If tensors are omitted, AutoswitchGemm infers:
+      # fprop -> [activation, weight]
+      # dgrad -> [gradient, weight]
+      # wgrad -> [activation, gradient]
+      gemms: [fprop, dgrad, wgrad]
+      tensors: [activation, weight, gradient]
+
+      # Switch to high precision when any monitored tensor for the GEMM
+      # exceeds either threshold.
+      underflow_threshold_pct: 5
+      mse_threshold: 0.1
+
+      # If model parameters are stored in a quantized format, fprop/dgrad can
+      # switch to high precision by using temporary dequantized weights.
+      allow_fp8_model_params_dequantized_weight: True
+
+      # Optional: in hold-window non-sampling steps, route directly to
+      # high-precision plans when source tensors are available in bf16/fp16.
+      # This avoids quantize->dequantize conversion in runtime hooks.
+      direct_high_precision_in_hold_window: True
+
+      # Start sampling at step 10, then sample every 10 steps. A threshold
+      # breach at step N keeps that (layer, GEMM) in high precision through
+      # step N + freq - 1. The next sampling step refreshes the decision.
+      freq: 10
+      start_step: 10
+
+# Autoswitch per-rank metrics are written to:
+#   <log_dir>/nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log