Skip to main content
Log in

Mojo struct

Q4_0Encoding

The Q4_0 quantization encoding.

Q4_0 is a block quantization scheme originally designed for GGML in which each element (number) is reduced to an unsigned, fixed-point, 4-bit value. Multiple quantized elements are packed together in a block, all using the same float16 scale.

The packing scheme requires that the innermost dimension is a factor of 32. When the tensor is quantized to Q4_0, each block of 32 scalar values is packed into 18 bytes. The first two bytes specify the float16 quantization scale, and the other 16 bytes hold the 32 values (one byte holds two 4-bit values).

Because this holds the quantized data in a special packing format, it currently does not print float values at runtime—it's just a bag of bits in uint8 format.

Implemented traits

AnyType, QuantizationEncoding

Methods

quantize

static quantize(tensor: Tensor[float32]) -> Tensor[uint8]

Quantizes the full-precision tensor to Q4_0.

Args:

  • tensor (Tensor[float32]): Full-precision tensor to quantize. The innermost dimension of the tensor must be a factor of 32.

Returns:

Quantized Q4_0 tensor. The tensor datatype is uint8 because this is simply a bytes buffer. Each scalar is actually stored with 4 bits.

Raises:

If the last dimension size is not a factor of 32.

id

static id() -> String

Identifier for the Q4_0 quantized encoding.

Was this page helpful?