Python class
QuantConfig
QuantConfig
class max.nn.QuantConfig(input_scale, weight_scale, mlp_quantized_layers, attn_quantized_layers, format, embedding_output_dtype=None, bias_dtype=None, can_use_fused_mlp=False, scales_pre_interleaved=False)
Bases: object
Configures scaled quantization settings for a layer or model section.
-
Parameters:
-
- input_scale (InputScaleSpec)
- weight_scale (WeightScaleSpec)
- mlp_quantized_layers (set[int])
- attn_quantized_layers (set[int])
- format (QuantFormat)
- embedding_output_dtype (DType | None)
- bias_dtype (DType | None)
- can_use_fused_mlp (bool)
- scales_pre_interleaved (bool)
attn_quantized_layers
Set of layer indices with quantized attention QKV projections.
QKV projections are considered to be either “all quantized” or all not quantized per layer. So either all of {q,k,v,o}_proj are quantized, or all bfloat16.
bias_dtype
The DType of bias weights.
can_use_fused_mlp
can_use_fused_mlp: bool = False
Whether the quantization scales can be used with fused MLP operations.
embedding_output_dtype
The DType of the output from the embedding layer.
format
format: QuantFormat
The QuantFormat identifying the quantization format.
input_scale
input_scale: InputScaleSpec
InputScaleSpec for input activation scaling.
is_dynamic
property is_dynamic: bool
True if this input scale is dynamic.
is_fp4
property is_fp4: bool
True if this config represents any FP4 variant (NVFP4 or MXFP4).
is_mxfp4
property is_mxfp4: bool
Returns True if this config represents MXFP4 quantization.
is_nvfp4
property is_nvfp4: bool
True if this config represents modelopt NVFP4.
is_static
property is_static: bool
True if this input scale is static.
mlp_quantized_layers
Set of layer indices with quantized MLPs.
MLPs are considered to be either “all quantized” or all not quantized per layer. So either all of gate proj, down proj, and up proj are quantized, or all bfloat16.
quantized_scales_type()
quantized_scales_type(quantized_shape, device_ref)
The TensorType of the scales tensor after dynamic quantization.
-
Parameters:
-
Return type:
scales_granularity_mnk
The weight and input scale granularities on the M, N, and K axes.
scales_pre_interleaved
scales_pre_interleaved: bool = False
Whether weight scales in the checkpoint are already stored in the 5D TCGEN-interleaved layout expected by the FP4 matmul kernel (NVFP4 only). Note that scales in the 5D TCGEN-interleaved layout are typically flattened to 2D [M, K//16] in the checkpoint.
weight_scale
weight_scale: WeightScaleSpec
WeightScaleSpec for weight scaling.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!