Mojo module

fp8_quantization

`comptime` values

`logger`

comptime logger = Logger(stdout, prefix=String(""), source_location=False)

Functions

batched_quantize_dynamic_scaled_fp8: TileTensor primary implementation of batched dynamic scaled FP8 quantization.
batched_quantize_fp8_kernel:
blockwise_scaled_fp8_with_epilogue: Our sm100 blockwise scaled fp8 matmul kernel still does not support fusion of elementwise operations. This is a temporary implementation that uses our sm100 blockwise scaled fp8 matmul kernel and dispatch a separate epilogue kernel to apply the elementwise operations. For non B200 GPUs, we use the naive blockwise scaled fp8 matmul which support normal epilogue natively. Callers must allocate c; when an elementwise_lambda_fn is supplied the matmul result is written into c and then read back by the lambda.
compute_scales_fp8_kernel: Compute per-group FP8 scale factors without quantizing.
convert_e4m3fn_to_e4m3fnuz: Convert E4M3FN weights to E4M3FNUZ format for AMD GPU compatibility.
convert_kernel_unified:
matmul_dynamic_scaled_fp8: TileTensor primary implementation of dynamic scaled FP8 matmul.
max_reduction_scale_kernel: Per-row strided max-|x| reduction into a global FP8 scale.
naive_blockwise_scaled_fp8_grouped_matmul:
naive_blockwise_scaled_fp8_grouped_matmul_kernel:
naive_blockwise_scaled_fp8_matmul:
naive_blockwise_scaled_fp8_matmul_kernel:
quantize_dynamic_scaled_fp8: TileTensor primary implementation of dynamic scaled FP8 quantization.
quantize_fp8_kernel:
quantize_fp8_kernel_per_tensor: Per-tensor FP8 quantize kernel.
quantize_static_scaled_fp8: TileTensor implementation of static scaled FP8 quantization.
quantize_tensor_dynamic_scaled_fp8: TileTensor primary implementation of dynamic scaled FP8 quantization.
scaled_fp8_quant_unified:
zero_scale_global_kernel:

comptime values​

logger​

Functions​

`comptime` values

`logger`

Functions