Quantization
MAX supports both executing pre-quantized models and quantizing models through the Graph API. Quantization is an optimization technique that reduces the numeric precision of weights in a model using various quantization encodings. Our API is designed for low-level graph engineers who want to quantize specific weights in a model. This API does not quantize an entire model. Like the MAX Graph API, this is a low-level API meant for engineers who want to build high-performance graphs in a systems programming language—specifically, in Mojo.
For example, models trained with float32 weights, but can use lower precision types such as int8 or int4. That is, instead of storing each scalar value with 32-bits, you can use just 8 or 4 bits. This reduces the computational and memory demands during inference, which makes the model faster and compatible with more systems.
Overview
When used properly, quantization does not significantly affect the model accuracy. There are several different quantization encodings that provide different levels of precision and encoding formats, each with its own trade-offs that may work well for some models or graph operations ("ops") but not others. Some models also work great with a mixture of quantization types, so that only certain ops perform low-precision calculations while others retain high precision.
To support this mixed-precision strategy, the quantization API in MAX Graph is declarative. That means you can quantize the weights in your model explicitly as you see fit, rather than pick one quantization format for the whole model. You can quantize different weights with different encodings, write custom ops that understand your quantizations, and even implement your own quantization encodings.
The API provides several options for implementing your quantization strategy:
- Mix quantization formats within the same model
- Apply custom quantization encodings
- Use pre-built encodings (
Q4_0
,Q4_K
,Q6_K
)
For example, to quantize your model weights, follow these steps:
- Import your trained model weights
- Choose your quantization encodings
- Apply quantization using
Graph.quantize()
Learn More
You can learn more about the MAX Graph quantization API by reading more about our API reference or the Sample implementation to see how we quantized a 15M parameter Llama 2 model.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!