Mojo struct
Q4_0Encoding
The Q4_0 quantization encoding.
Q4_0 is a block quantization scheme originally designed for GGML in which each element (number) is reduced to an unsigned, fixed-point, 4-bit value. Multiple quantized elements are packed together in a block, all using the same float16 scale.
The packing scheme requires that the innermost dimension is a factor of 32. When the tensor is quantized to Q4_0, each block of 32 scalar values is packed into 18 bytes. The first two bytes specify the float16 quantization scale, and the other 16 bytes hold the 32 values (one byte holds two 4-bit values).
Because this holds the quantized data in a special packing format, it currently does not print float values at runtime—it's just a bag of bits in uint8 format.
Implemented traits
AnyType
,
QuantizationEncoding
Methods
quantize
static quantize(tensor: Tensor[float32]) -> Tensor[uint8]
Quantizes the full-precision tensor to Q4_0.
Args:
- tensor (
Tensor[float32]
): Full-precision tensor to quantize. The innermost dimension of the tensor must be a factor of 32.
Returns:
Quantized Q4_0 tensor. The tensor datatype is uint8
because this is simply a bytes buffer. Each scalar is actually stored with 4 bits.
Raises:
If the last dimension size is not a factor of 32.
id
static id() -> String
Identifier for the Q4_0 quantized encoding.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
😔 What went wrong?