Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 82 additions & 1 deletion docs/debug/1_getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,10 +149,11 @@ Inspecting the logs
-------------------


Let's look at the files with the logs. Two files will be created:
Let's look at the files with the logs. At least two files will be created:

1. debug logs.
2. statistics logs.
3. optional feature-specific logs (for example AutoswitchGemm metrics).

Let's look inside them!

Expand Down Expand Up @@ -214,6 +215,86 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std iteration=000004 value=0.9996
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000004 value=130776.7969

AutoswitchGemm quick guide
--------------------------

``AutoswitchGemm`` monitors quantization quality and can dynamically switch selected GEMMs
to high precision when thresholds are exceeded. It supports the normal FP8 paths as well
as block-scaled formats such as FP8 blockwise and MXFP8, as long as the selected TE module
routes the GEMM through the AutoswitchGemm runtime hooks.

Example config matching attention and MLP linears:

.. code-block:: yaml

log_tensor_stats_all:
enabled: True
layers:
layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
transformer_engine:
LogTensorStats:
enabled: True
stats: [max, min, mean, std, dynamic_range, cur_amax]
tensors: [activation, gradient, weight]
freq: 10
start_step: 10
AutoswitchGemm:
enabled: True
gemms: [fprop, dgrad, wgrad]
tensors: [activation, weight, gradient]
underflow_threshold_pct: 5
mse_threshold: 0.1
allow_fp8_model_params_dequantized_weight: True
direct_high_precision_in_hold_window: True
freq: 10
start_step: 10

Behavior summary:

1. For each ``(layer, gemm)``, AutoswitchGemm tracks the latest tensor metrics and applies
OR logic across monitored tensors: if any tensor breaches thresholds, that GEMM switches.
2. Sampling is controlled by ``start_step``, ``end_step`` / ``start_end_list``, and
``freq``. For example, ``start_step: 10`` and ``freq: 10`` samples at steps
10, 20, 30, ...
3. A threshold breach at sampling step ``n`` keeps the affected ``(layer, gemm)`` in
high precision through ``n + freq - 1``. The next sampling step refreshes the
decision; if thresholds are not breached, the GEMM returns to quantized execution.
4. If model parameters are stored in a quantized format, set
``allow_fp8_model_params_dequantized_weight: True`` to allow ``fprop`` and
``dgrad`` to switch by using temporary dequantized weights.
5. Set ``direct_high_precision_in_hold_window: True`` to directly select
high-precision tensor plans on non-sampling hold-window iterations. This
bypasses runtime quantize->dequantize conversion when high-precision source
tensors are available.
6. When CUDA Graphs are used, sampling and high-precision windows must run in eager
mode. Quantized windows can continue using CUDA Graphs if the training framework
supports this routing. Megatron-LM support for this workflow depends on the
``autogemm`` branch:
https://github.com/shangxiaokang/Megatron-LM/tree/autogemm

When AutoswitchGemm is enabled, an additional directory is created under ``log_dir``:

``nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log``

It contains per-rank, per-iteration metrics such as:

- ``<layer>_<gemm>_<tensor>_underflow_pct``
- ``<layer>_<gemm>_<tensor>_mse``
- ``<layer>_<gemm>_quantized_enabled``
- ``<layer>_<gemm>_disable_until_iter``
- ``<layer>_<gemm>_switch_blocked_fp8_model_params``
- ``<layer>_<gemm>_fp8_model_params_dequantized_fallback``
- ``<layer>_<gemm>_final_decision`` with fields such as
``requested_precision``, ``precision``, ``lhs_quantized``, and ``rhs_quantized``.

A typical Megatron-LM launch exports the debug config and log directory:

.. code-block:: bash

export ENABLE_NVDFW_INSPECT=1
export NVDFW_CONFIG_FILE=/path/to/nvdlfw_inspect_30b.yaml
export NVDFW_LOG_DIR=/path/to/output/nvdlfw_logs

Logging using TensorBoard
-------------------------

Expand Down
58 changes: 58 additions & 0 deletions docs/debug/2_config_file_structure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,64 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
tensor_feature_param2: value
gemm_feature_param1: value

AutoswitchGemm notes
--------------------

``AutoswitchGemm`` supports both global and per-GEMM configuration.

- Use ``gemms: [...]`` for one shared policy.
- Use ``gemms_struct`` to set per-GEMM thresholds.

If ``tensors``/``tensors_struct`` are omitted, monitored tensors are inferred from GEMMs:

- ``fprop`` -> ``activation``, ``weight``
- ``dgrad`` -> ``gradient``, ``weight``
- ``wgrad`` -> ``activation``, ``gradient``

Other important keys:

- ``underflow_threshold_pct``: switch trigger based on underflow percentage.
- ``mse_threshold``: switch trigger based on quantization MSE.
- ``freq``: sampling interval. A sampled threshold breach at iteration ``n`` keeps
that ``(layer, gemm)`` in high precision through ``n + freq - 1``.
- ``start_step`` / ``end_step`` / ``start_end_list``: sampling windows. If ``end_step``
is omitted, sampling continues according to ``freq`` after ``start_step``.
- ``allow_fp8_model_params_dequantized_weight``: allows ``fprop``/``dgrad`` switching
for layers with quantized model parameters by using temporary dequantized weights.
- ``AutoswitchGemm`` should use the same ``freq`` / sampling window as companion
tensor-inspection features such as ``LogTensorStats`` when they share the same
layers and tensors.

Example for attention and MLP linear layers:

.. code-block:: yaml

log_tensor_stats_all:
enabled: True
layers:
layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
transformer_engine:
LogTensorStats:
enabled: True
stats: [max, min, mean, std, dynamic_range, cur_amax]
tensors: [activation, gradient, weight]
freq: 10
start_step: 10
AutoswitchGemm:
enabled: True
gemms: [fprop, dgrad, wgrad]
tensors: [activation, weight, gradient]
underflow_threshold_pct: 5
mse_threshold: 0.1
allow_fp8_model_params_dequantized_weight: True
freq: 10
start_step: 10

For CUDA Graph training, sampling and high-precision windows must be executed in eager
mode. Quantized windows may continue to use CUDA Graphs if the training framework routes
them separately. The Megatron-LM integration used by this example depends on:
https://github.com/shangxiaokang/Megatron-LM/tree/autogemm

Enabling or Disabling Sections and Features
-------------------------------------------

Expand Down
1 change: 1 addition & 0 deletions docs/debug/3_api_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Debug features
.. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
.. autoapiclass:: transformer_engine.debug.features.log_nvfp4_tensor_stats.LogNvfp4TensorStats
.. autoapiclass:: transformer_engine.debug.features.disable_quantization_gemm.DisableQuantizationGEMM
.. autoapiclass:: transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
.. autoapiclass:: transformer_engine.debug.features.disable_quantization_layer.DisableQuantizationLayer
.. autoapiclass:: transformer_engine.debug.features.per_tensor_scaling.PerTensorScaling
.. autoapiclass:: transformer_engine.debug.features.fake_quant.FakeQuant
Expand Down
60 changes: 60 additions & 0 deletions docs/debug/autoswitch_gemm_example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Example config for transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
#
# Usage:
# import nvdlfw_inspect.api as debug_api
# debug_api.initialize(
# config_file="docs/debug/autoswitch_gemm_example.yaml",
# feature_dirs=["transformer_engine/debug/features"],
# log_dir="./log",
# )
# ...
# debug_api.step() # call once per training step

log_tensor_stats_all:
enabled: True
layers:
# Names may be inferred by Megatron/TE. This matches attention linears and
# common MLP/MoE linears used by Qwen3-style models.
layer_types: [linear_qkv, linear_proj, linear_fc1, linear_fc2]
transformer_engine:
LogTensorStats:
enabled: True
stats: [max, min, mean, std, dynamic_range, cur_amax]
tensors: [activation, gradient, weight]
# Match AutoswitchGemm's schedule when both features share the same
# inspect_tensor_enabled API calls.
freq: 10
start_step: 10

AutoswitchGemm:
enabled: True

# Enable all GEMM paths. If tensors are omitted, AutoswitchGemm infers:
# fprop -> [activation, weight]
# dgrad -> [gradient, weight]
# wgrad -> [activation, gradient]
gemms: [fprop, dgrad, wgrad]
tensors: [activation, weight, gradient]

# Switch to high precision when any monitored tensor for the GEMM
# exceeds either threshold.
underflow_threshold_pct: 5
mse_threshold: 0.1

# If model parameters are stored in a quantized format, fprop/dgrad can
# switch to high precision by using temporary dequantized weights.
allow_fp8_model_params_dequantized_weight: True

# Optional: in hold-window non-sampling steps, route directly to
# high-precision plans when source tensors are available in bf16/fp16.
# This avoids quantize->dequantize conversion in runtime hooks.
direct_high_precision_in_hold_window: True

# Start sampling at step 10, then sample every 10 steps. A threshold
# breach at step N keeps that (layer, GEMM) in high precision through
# step N + freq - 1. The next sampling step refreshes the decision.
freq: 10
start_step: 10

# Autoswitch per-rank metrics are written to:
# <log_dir>/nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log
Loading