Skip to main content

Python class

QuantConfig

QuantConfig

class max.nn.QuantConfig(input_scale, weight_scale, mlp_quantized_layers, attn_quantized_layers, format, embedding_output_dtype=None, bias_dtype=None, can_use_fused_mlp=False, scales_pre_interleaved=False)

source

Bases: object

Configures scaled quantization settings for a layer or model section.

Parameters:

attn_quantized_layers

attn_quantized_layers: set[int]

source

Set of layer indices with quantized attention QKV projections.

QKV projections are considered to be either “all quantized” or all not quantized per layer. So either all of {q,k,v,o}_proj are quantized, or all bfloat16.

bias_dtype

bias_dtype: DType | None = None

source

The DType of bias weights.

can_use_fused_mlp

can_use_fused_mlp: bool = False

source

Whether the quantization scales can be used with fused MLP operations.

embedding_output_dtype

embedding_output_dtype: DType | None = None

source

The DType of the output from the embedding layer.

format

format: QuantFormat

source

The QuantFormat identifying the quantization format.

input_scale

input_scale: InputScaleSpec

source

InputScaleSpec for input activation scaling.

is_dynamic

property is_dynamic: bool

source

True if this input scale is dynamic.

is_fp4

property is_fp4: bool

source

True if this config represents any FP4 variant (NVFP4 or MXFP4).

is_mxfp4

property is_mxfp4: bool

source

Returns True if this config represents MXFP4 quantization.

is_nvfp4

property is_nvfp4: bool

source

True if this config represents modelopt NVFP4.

is_static

property is_static: bool

source

True if this input scale is static.

mlp_quantized_layers

mlp_quantized_layers: set[int]

source

Set of layer indices with quantized MLPs.

MLPs are considered to be either “all quantized” or all not quantized per layer. So either all of gate proj, down proj, and up proj are quantized, or all bfloat16.

quantized_scales_type()

quantized_scales_type(quantized_shape, device_ref)

source

The TensorType of the scales tensor after dynamic quantization.

Parameters:

Return type:

TensorType

scales_granularity_mnk

property scales_granularity_mnk: tuple[int, int, int]

source

The weight and input scale granularities on the M, N, and K axes.

scales_pre_interleaved

scales_pre_interleaved: bool = False

source

Whether weight scales in the checkpoint are already stored in the 5D TCGEN-interleaved layout expected by the FP4 matmul kernel (NVFP4 only). Note that scales in the 5D TCGEN-interleaved layout are typically flattened to 2D [M, K//16] in the checkpoint.

weight_scale

weight_scale: WeightScaleSpec

source

WeightScaleSpec for weight scaling.