Skip to main content

Python class

QuantConfig

QuantConfig

class max.nn.QuantConfig(input_scale, weight_scale, mlp_quantized_layers, attn_quantized_layers, format, embedding_output_dtype=None, bias_dtype=None, can_use_fused_mlp=False, scales_pre_interleaved=False)

source

Bases: object

Configures scaled quantization settings for a layer or model section.

For example, to configure NVFP4 block-scaled quantization for all layers in a 19-layer model:

from max.dtype import DType
from max.nn import QuantConfig, QuantFormat
from max.nn.quant_config import (
    InputScaleSpec,
    ScaleGranularity,
    ScaleOrigin,
    WeightScaleSpec,
)

all_layers = set(range(19))

input_spec = InputScaleSpec(
    granularity=ScaleGranularity.BLOCK,
    origin=ScaleOrigin.STATIC,
    dtype=DType.float32,
    block_size=(1, 16),
)
weight_spec = WeightScaleSpec(
    granularity=ScaleGranularity.BLOCK,
    dtype=DType.float8_e4m3fn,
    block_size=(1, 8),
)
config = QuantConfig(
    input_scale=input_spec,
    weight_scale=weight_spec,
    mlp_quantized_layers=all_layers,
    attn_quantized_layers=all_layers,
    format=QuantFormat.NVFP4,
)

Parameters:

attn_quantized_layers

attn_quantized_layers: set[int]

source

Set of layer indices with quantized attention projections.

Attention projections are quantized on an all-or-nothing basis per layer: either all of q_proj, k_proj, v_proj, and o_proj are quantized, or all four remain in bfloat16.

bias_dtype

bias_dtype: DType | None = None

source

The DType of bias weights.

can_use_fused_mlp

can_use_fused_mlp: bool = False

source

Whether the quantization scales can be used with fused MLP operations.

embedding_output_dtype

embedding_output_dtype: DType | None = None

source

The DType of the output from the embedding layer.

format

format: QuantFormat

source

The QuantFormat identifying the quantization format.

input_scale

input_scale: InputScaleSpec

source

InputScaleSpec for input activation scaling.

is_dynamic

property is_dynamic: bool

source

True if this input scale is dynamic.

is_fp4

property is_fp4: bool

source

True if this config represents any FP4 variant (NVFP4 or MXFP4).

is_mxfp4

property is_mxfp4: bool

source

Returns True if this config represents MXFP4 quantization.

is_nvfp4

property is_nvfp4: bool

source

True if this config represents modelopt NVFP4.

is_static

property is_static: bool

source

True if this input scale is static.

mlp_quantized_layers

mlp_quantized_layers: set[int]

source

Set of layer indices with quantized MLPs.

MLPs are quantized on an all-or-nothing basis per layer: either all of gate_proj, down_proj, and up_proj are quantized, or all three remain in bfloat16.

quantized_scales_type()

quantized_scales_type(quantized_shape, device_ref)

source

The TensorType of the scales tensor after dynamic quantization.

Parameters:

Return type:

TensorType

scales_granularity_mnk

property scales_granularity_mnk: tuple[int, int, int]

source

The weight and input scale granularities on the M, N, and K axes.

scales_pre_interleaved

scales_pre_interleaved: bool = False

source

Whether weight scales in the checkpoint are already stored in the 5D TCGEN-interleaved layout expected by the FP4 matmul kernel (NVFP4 only). Note that scales in the 5D TCGEN-interleaved layout are typically flattened to 2D [M, K//16] in the checkpoint.

weight_scale

weight_scale: WeightScaleSpec

source

WeightScaleSpec for weight scaling.