Python class
QuantConfig
QuantConfig
class max.nn.QuantConfig(input_scale, weight_scale, mlp_quantized_layers, attn_quantized_layers, format, embedding_output_dtype=None, bias_dtype=None, can_use_fused_mlp=False, scales_pre_interleaved=False)
Bases: object
Configures scaled quantization settings for a layer or model section.
For example, to configure NVFP4 block-scaled quantization for all layers in a 19-layer model:
from max.dtype import DType
from max.nn import QuantConfig, QuantFormat
from max.nn.quant_config import (
InputScaleSpec,
ScaleGranularity,
ScaleOrigin,
WeightScaleSpec,
)
all_layers = set(range(19))
input_spec = InputScaleSpec(
granularity=ScaleGranularity.BLOCK,
origin=ScaleOrigin.STATIC,
dtype=DType.float32,
block_size=(1, 16),
)
weight_spec = WeightScaleSpec(
granularity=ScaleGranularity.BLOCK,
dtype=DType.float8_e4m3fn,
block_size=(1, 8),
)
config = QuantConfig(
input_scale=input_spec,
weight_scale=weight_spec,
mlp_quantized_layers=all_layers,
attn_quantized_layers=all_layers,
format=QuantFormat.NVFP4,
)-
Parameters:
-
- input_scale (InputScaleSpec)
- weight_scale (WeightScaleSpec)
- mlp_quantized_layers (set[int])
- attn_quantized_layers (set[int])
- format (QuantFormat)
- embedding_output_dtype (DType | None)
- bias_dtype (DType | None)
- can_use_fused_mlp (bool)
- scales_pre_interleaved (bool)
attn_quantized_layers
Set of layer indices with quantized attention projections.
Attention projections are quantized on an all-or-nothing basis per layer:
either all of q_proj, k_proj, v_proj, and o_proj are
quantized, or all four remain in bfloat16.
bias_dtype
The DType of bias weights.
can_use_fused_mlp
can_use_fused_mlp: bool = False
Whether the quantization scales can be used with fused MLP operations.
embedding_output_dtype
The DType of the output from the embedding layer.
format
format: QuantFormat
The QuantFormat identifying the quantization format.
input_scale
input_scale: InputScaleSpec
InputScaleSpec for input activation scaling.
is_dynamic
property is_dynamic: bool
True if this input scale is dynamic.
is_fp4
property is_fp4: bool
True if this config represents any FP4 variant (NVFP4 or MXFP4).
is_mxfp4
property is_mxfp4: bool
Returns True if this config represents MXFP4 quantization.
is_nvfp4
property is_nvfp4: bool
True if this config represents modelopt NVFP4.
is_static
property is_static: bool
True if this input scale is static.
mlp_quantized_layers
Set of layer indices with quantized MLPs.
MLPs are quantized on an all-or-nothing basis per layer: either all of
gate_proj, down_proj, and up_proj are quantized, or all three
remain in bfloat16.
quantized_scales_type()
quantized_scales_type(quantized_shape, device_ref)
The TensorType of the scales tensor after dynamic quantization.
-
Parameters:
-
Return type:
scales_granularity_mnk
The weight and input scale granularities on the M, N, and K axes.
scales_pre_interleaved
scales_pre_interleaved: bool = False
Whether weight scales in the checkpoint are already stored in the 5D
TCGEN-interleaved layout expected by the FP4 matmul kernel (NVFP4 only).
Note that scales in the 5D TCGEN-interleaved layout are typically flattened
to 2D [M, K//16] in the checkpoint.
weight_scale
weight_scale: WeightScaleSpec
WeightScaleSpec for weight scaling.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!