Skip to main content

Python class

QuantizationEncoding

QuantizationEncoding

class max.graph.quantization.QuantizationEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

source

Bases: Enum

Quantization encodings supported by MAX Graph.

Quantization reduces the precision of neural network weights to decrease memory usage and potentially improve inference speed. Each encoding represents a different compression method with specific trade-offs between model size, accuracy, and computational efficiency. These encodings are commonly used with pre-quantized model checkpoints (especially GGUF format) or can be applied during weight allocation.

The following example shows how to create a quantized weight using the Q4_K encoding:

from max.graph.quantization import QuantizationEncoding
from max.graph import Weight

encoding = QuantizationEncoding.Q4_K
quantized_weight = Weight(
    name="linear.weight",
    dtype=DType.uint8,
    shape=[4096, 4096],
    device=DeviceRef.GPU(0),
    quantization_encoding=encoding
)

MAX supports several quantization formats optimized for different use cases.

Q4_0

Q4_0

source

Basic 4-bit quantization with 32 elements per block.

Q4_K

Q4_K

source

4-bit K-quantization with 256 elements per block.

Q5_K

Q5_K

source

5-bit K-quantization with 256 elements per block.

Q6_K

Q6_K

source

6-bit K-quantization with 256 elements per block.

GPTQ

GPTQ

source

Group-wise Post-Training Quantization for large language models.

GPTQ

GPTQ = 'GPTQ'

source

Q4_0

Q4_0 = 'Q4_0'

source

Q4_K

Q4_K = 'Q4_K'

source

Q5_K

Q5_K = 'Q5_K'

source

Q6_K

Q6_K = 'Q6_K'

source

block_parameters

property block_parameters: BlockParameters

source

Gets the block parameters for this quantization encoding.

Returns:

The parameters describing how elements are organized and encoded in blocks for this quantization encoding.

Return type:

BlockParameters

block_size

property block_size: int

source

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.

Returns:

Size in bytes of each encoded quantization block.

Return type:

int

elements_per_block

property elements_per_block: int

source

Number of elements per block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.

Returns:

Number of original tensor elements in each quantized block.

Return type:

int

is_gguf

property is_gguf: bool

source

Checks if this quantization encoding is compatible with GGUF format.

GGUF is a format for storing large language models and compatible quantized weights.

Returns:

True if this encoding is compatible with GGUF, False otherwise.

Return type:

bool

name

property name: str

source

Gets the lowercase name of the quantization encoding.

Returns:

Lowercase string representation of the quantization encoding.

Return type:

str