Python class

QuantizationEncoding

`QuantizationEncoding`

class max.graph.quantization.QuantizationEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

source

Bases: Enum

Quantization encodings supported by MAX Graph.

Quantization reduces the precision of neural network weights to decrease memory usage and potentially improve inference speed. Each encoding represents a different compression method with specific trade-offs between model size, accuracy, and computational efficiency. These encodings are commonly used with pre-quantized model checkpoints (especially GGUF format) or can be applied during weight allocation.

The following example shows how to create a quantized weight using the Q4_K encoding:

from max.graph.quantization import QuantizationEncoding
from max.graph import Weight

encoding = QuantizationEncoding.Q4_K
quantized_weight = Weight(
    name="linear.weight",
    dtype=DType.uint8,
    shape=[4096, 4096],
    device=DeviceRef.GPU(0),
    quantization_encoding=encoding
)

MAX supports several quantization formats optimized for different use cases.

`Q4_0`

Q4_0

source

Basic 4-bit quantization with 32 elements per block.

`Q4_K`

Q4_K

source

4-bit K-quantization with 256 elements per block.

`Q5_K`

Q5_K

source

5-bit K-quantization with 256 elements per block.

`Q6_K`

Q6_K

source

6-bit K-quantization with 256 elements per block.

`GPTQ`

GPTQ

source

Group-wise Post-Training Quantization for large language models.

`GPTQ`

GPTQ = 'GPTQ'

source

`Q4_0`

Q4_0 = 'Q4_0'

source

`Q4_K`

Q4_K = 'Q4_K'

source

`Q5_K`

Q5_K = 'Q5_K'

source

`Q6_K`

Q6_K = 'Q6_K'

source

`block_parameters`

property block_parameters: BlockParameters

source

Gets the block parameters for this quantization encoding.

Returns:: The parameters describing how elements are organized and encoded in blocks for this quantization encoding.
Return type:: BlockParameters

`block_size`

property block_size: int

source

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.

Returns:: Size in bytes of each encoded quantization block.
Return type:: int

`elements_per_block`

property elements_per_block: int

source

Number of elements per block.

Returns:: Number of original tensor elements in each quantized block.
Return type:: int

`is_gguf`

property is_gguf: bool

source

Checks if this quantization encoding is compatible with GGUF format.

GGUF is a format for storing large language models and compatible quantized weights.

Returns:: True if this encoding is compatible with GGUF, False otherwise.
Return type:: bool

`name`

property name: str

source

Gets the lowercase name of the quantization encoding.

Returns:: Lowercase string representation of the quantization encoding.
Return type:: str

QuantizationEncoding​

Q4_0​

Q4_K​

Q5_K​

Q6_K​

GPTQ​

GPTQ​

Q4_0​

Q4_K​

Q5_K​

Q6_K​

block_parameters​

block_size​

elements_per_block​

is_gguf​

name​

`QuantizationEncoding`

`Q4_0`

`Q4_K`

`Q5_K`

`Q6_K`

`GPTQ`

`GPTQ`

`Q4_0`

`Q4_K`

`Q5_K`

`Q6_K`

`block_parameters`

`block_size`

`elements_per_block`

`is_gguf`

`name`