Python class

QuantConfig

`QuantConfig`

class max.nn.QuantConfig(input_scale, weight_scale, mlp_quantized_layers, attn_quantized_layers, format, embedding_output_dtype=None, bias_dtype=None, can_use_fused_mlp=False, scales_pre_interleaved=False)

source

Bases: object

Configures scaled quantization settings for a layer or model section.

For example, to configure NVFP4 block-scaled quantization for all layers in a 19-layer model:

from max.dtype import DType
from max.nn import QuantConfig, QuantFormat
from max.nn.quant_config import (
    InputScaleSpec,
    ScaleGranularity,
    ScaleOrigin,
    WeightScaleSpec,
)

all_layers = set(range(19))

input_spec = InputScaleSpec(
    granularity=ScaleGranularity.BLOCK,
    origin=ScaleOrigin.STATIC,
    dtype=DType.float32,
    block_size=(1, 16),
)
weight_spec = WeightScaleSpec(
    granularity=ScaleGranularity.BLOCK,
    dtype=DType.float8_e4m3fn,
    block_size=(1, 8),
)
config = QuantConfig(
    input_scale=input_spec,
    weight_scale=weight_spec,
    mlp_quantized_layers=all_layers,
    attn_quantized_layers=all_layers,
    format=QuantFormat.NVFP4,
)

Parameters:

input_scale (InputScaleSpec)
weight_scale (WeightScaleSpec)
mlp_quantized_layers (set[int])
attn_quantized_layers (set[int])
format (QuantFormat)
embedding_output_dtype (DType | None)
bias_dtype (DType | None)
can_use_fused_mlp (bool)
scales_pre_interleaved (bool)

`attn_quantized_layers`

attn_quantized_layers: set[int]

source

Set of layer indices with quantized attention projections.

Attention projections are quantized on an all-or-nothing basis per layer: either all of q_proj, k_proj, v_proj, and o_proj are quantized, or all four remain in bfloat16.

`bias_dtype`

bias_dtype: DType | None = None

source

The DType of bias weights.

`can_use_fused_mlp`

can_use_fused_mlp: bool = False

source

Whether the quantization scales can be used with fused MLP operations.

`embedding_output_dtype`

embedding_output_dtype: DType | None = None

source

The DType of the output from the embedding layer.

`format`

format: QuantFormat

source

The QuantFormat identifying the quantization format.

`input_scale`

input_scale: InputScaleSpec

source

InputScaleSpec for input activation scaling.

`is_dynamic`

property is_dynamic: bool

source

True if this input scale is dynamic.

`is_fp4`

property is_fp4: bool

source

True if this config represents any FP4 variant (NVFP4 or MXFP4).

`is_mxfp4`

property is_mxfp4: bool

source

Returns True if this config represents MXFP4 quantization.

`is_nvfp4`

property is_nvfp4: bool

source

True if this config represents modelopt NVFP4.

`is_static`

property is_static: bool

source

True if this input scale is static.

`mlp_quantized_layers`

mlp_quantized_layers: set[int]

source

Set of layer indices with quantized MLPs.

MLPs are quantized on an all-or-nothing basis per layer: either all of gate_proj, down_proj, and up_proj are quantized, or all three remain in bfloat16.

`quantized_scales_type()`

quantized_scales_type(quantized_shape, device_ref)

source

The TensorType of the scales tensor after dynamic quantization.

Parameters:

quantized_shape (Shape)
device_ref (DeviceRef)

Return type:

TensorType

`scales_granularity_mnk`

property scales_granularity_mnk: tuple[int, int, int]

source

The weight and input scale granularities on the M, N, and K axes.

`scales_pre_interleaved`

scales_pre_interleaved: bool = False

source

Whether weight scales in the checkpoint are already stored in the 5D TCGEN-interleaved layout expected by the FP4 matmul kernel (NVFP4 only). Note that scales in the 5D TCGEN-interleaved layout are typically flattened to 2D [M, K//16] in the checkpoint.

`weight_scale`

weight_scale: WeightScaleSpec

source

WeightScaleSpec for weight scaling.

QuantConfig​

attn_quantized_layers​

bias_dtype​

can_use_fused_mlp​

embedding_output_dtype​

format​

input_scale​

is_dynamic​

is_fp4​

is_mxfp4​

is_nvfp4​

is_static​

mlp_quantized_layers​

quantized_scales_type()​

scales_granularity_mnk​

scales_pre_interleaved​

weight_scale​