Mojo struct
BFloat16Encoding
The bfloat16 quantization encoding.
Like float32, the bfloat16 encoding uses 8 bits to store the exponent value, so it has the same numeric range as float32. However, it has just 7 bits for the mantissa (compared to 23 bits available in float32), so it has less precision for the fractional part. This is often a better trade-off for ML applications, compared to traditional float16, which has less numeric range because it uses only 5 bits to store the exponent (though it has better precision with 10 bits for the mantissa).
Because this holds the quantized data in a special packing format, it currently does not print float values at runtimeβit's just a bag of bits in uint8 format.
Implemented traitsβ
AnyType
,
QuantizationEncoding
Methodsβ
quantize
β
static quantize(tensor: Tensor[float32]) -> Tensor[uint8]
Quantizes the full-precision input tensor to bfloat16.
Only supports quantizing from float16 and float32, using a direct elementwise cast.
Args:
- βtensor (
Tensor[float32]
): Full-precision tensor to quantize to bfloat16.
Returns:
Quantized bfloat16 tensor. The tensor datatype is uint8
because this is simply a byte buffer. Each scalar is actually encoded into two bytes (16-bits).
id
β
static id() -> String
Identifier for the bfloat16 quantized encoding.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub
π What went wrong?