Skip to main content
Log in

Python module

quantization

APIs to quantize graph tensors.

This package includes a comprehensive set of tools for working with quantized models in MAX Graph. It defines supported quantization encodings, configuration parameters that control the quantization process, and block parameter specifications for different quantization formats.

The module supports various quantization formats including 4-bit, 5-bit, and 6-bit precision with different encoding schemes. It also provides support for GGUF-compatible formats for interoperability with other frameworks.

BlockParameters

class max.graph.quantization.BlockParameters(elements_per_block: int, block_size: int)

Parameters describing the structure of a quantization block.

Block-based quantization stores elements in fixed-size blocks. Each block contains a specific number of elements in a compressed format.

block_size

block_size*: int*

elements_per_block

elements_per_block*: int*

QuantizationConfig

class max.graph.quantization.QuantizationConfig(quant_method: str, bits: int, group_size: int, desc_act: bool = False, sym: bool = False)

Configuration for specifying quantization parameters that affect inference.

These parameters control how tensor values are quantized, including the method, bit precision, grouping, and other characteristics that affect the trade-off between model size, inference speed, and accuracy.

bits

bits*: int*

desc_act

desc_act*: bool* = False

group_size

group_size*: int*

quant_method

quant_method*: str*

sym

sym*: bool* = False

QuantizationEncoding

class max.graph.quantization.QuantizationEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Quantization encodings supported by MAX Graph.

Each encoding represents a different method of quantizing model weights with specific trade-offs between compression ratio, accuracy, and computational efficiency.

GPTQ

GPTQ = 'GPTQ'

Q4_0

Q4_0 = 'Q4_0'

Q4_K

Q4_K = 'Q4_K'

Q5_K

Q5_K = 'Q5_K'

Q6_K

Q6_K = 'Q6_K'

block_parameters

property block_parameters*: BlockParameters*

Gets the block parameters for this quantization encoding.

  • Returns:

    The parameters describing how elements are organized : and encoded in blocks for this quantization encoding.

  • Return type:

    BlockParameters

block_size

property block_size*: int*

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.

  • Returns:

    Size in bytes of each encoded quantization block.

  • Return type:

    int

elements_per_block

property elements_per_block*: int*

Number of elements per block.

All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.

  • Returns:

    Number of original tensor elements in each quantized block.

  • Return type:

    int

is_gguf

property is_gguf*: bool*

Checks if this quantization encoding is compatible with GGUF format.

GGUF is a format for storing large language models and compatible quantized weights.

  • Returns:

    True if this encoding is compatible with GGUF, False otherwise.

  • Return type:

    bool