Python class
QuantizationEncoding
QuantizationEncoding
class max.graph.quantization.QuantizationEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases: Enum
Quantization encodings supported by MAX Graph.
Quantization reduces the precision of neural network weights to decrease memory usage and potentially improve inference speed. Each encoding represents a different compression method with specific trade-offs between model size, accuracy, and computational efficiency. These encodings are commonly used with pre-quantized model checkpoints (especially GGUF format) or can be applied during weight allocation.
The following example shows how to create a quantized weight using the Q4_K encoding:
from max.graph.quantization import QuantizationEncoding
from max.graph import Weight
encoding = QuantizationEncoding.Q4_K
quantized_weight = Weight(
name="linear.weight",
dtype=DType.uint8,
shape=[4096, 4096],
device=DeviceRef.GPU(0),
quantization_encoding=encoding
)MAX supports several quantization formats optimized for different use cases.
Q4_0
Q4_0
Basic 4-bit quantization with 32 elements per block.
Q4_K
Q4_K
4-bit K-quantization with 256 elements per block.
Q5_K
Q5_K
5-bit K-quantization with 256 elements per block.
Q6_K
Q6_K
6-bit K-quantization with 256 elements per block.
GPTQ
GPTQ
Group-wise Post-Training Quantization for large language models.
GPTQ
GPTQ = 'GPTQ'
Q4_0
Q4_0 = 'Q4_0'
Q4_K
Q4_K = 'Q4_K'
Q5_K
Q5_K = 'Q5_K'
Q6_K
Q6_K = 'Q6_K'
block_parameters
property block_parameters: BlockParameters
Gets the block parameters for this quantization encoding.
-
Returns:
-
The parameters describing how elements are organized and encoded in blocks for this quantization encoding.
-
Return type:
block_size
property block_size: int
Number of bytes in encoded representation of block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.
-
Returns:
-
Size in bytes of each encoded quantization block.
-
Return type:
elements_per_block
property elements_per_block: int
Number of elements per block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.
-
Returns:
-
Number of original tensor elements in each quantized block.
-
Return type:
is_gguf
property is_gguf: bool
Checks if this quantization encoding is compatible with GGUF format.
GGUF is a format for storing large language models and compatible quantized weights.
-
Returns:
-
True if this encoding is compatible with GGUF, False otherwise.
-
Return type:
name
property name: str
Gets the lowercase name of the quantization encoding.
-
Returns:
-
Lowercase string representation of the quantization encoding.
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!