For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

BlockwiseFP8TokenFormat

struct BlockwiseFP8TokenFormat[fp8_dtype: DType, scales_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, //, _hid_dim: Int, _top_k: Int, _alignment: Int = Int(0)]

Token format that quantizes the hidden state to FP8 with block-wise scales.

Each token is packed as FP8 quantized values followed by per-group scale factors (one scale per 128 elements). Reduces wire bandwidth by approximately 2x compared to BF16TokenFormat with minimal accuracy loss.

Parameters

fp8_dtype (DType): FP8 data type used for quantized values (e.g. DType.float8_e4m3fn).
scales_dtype (DType): Data type for the block-wise scale factors.
output_layout (TensorLayout): Layout of the FP8 output TileTensor.
scales_layout (TensorLayout): Layout of the scales output TileTensor.
_hid_dim (Int): Hidden dimension; must be divisible by the group size (128).
_top_k (Int): Number of experts each token is routed to.
_alignment (Int): Override for the byte alignment of the wire buffer; 0 selects get_device_alignment().

Fields

output_tokens (BlockwiseFP8TokenFormat[_hid_dim, _top_k, _alignment].TensorType):
output_scales (BlockwiseFP8TokenFormat[_hid_dim, _top_k, _alignment].ScalesTensorType):

Implemented traits

AnyType, Copyable, DevicePassable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TokenFormat, TrivialRegisterPassable

`comptime` members

`alignment`

comptime alignment = _alignment if _alignment.__bool__() else get_device_alignment()

`device_type`

comptime device_type = BlockwiseFP8TokenFormat[_hid_dim, _top_k, _alignment]

`dispatch_smem_size`

comptime dispatch_smem_size = 0

`dispatch_wait_tile_shape`

comptime dispatch_wait_tile_shape = Tuple(Int(128), Int(1))

`expert_m_padding`

comptime expert_m_padding = (Int(16) // size_of[scales_dtype]())

`group_size`

comptime group_size = 128

`hid_dim`

comptime hid_dim = _hid_dim

`ScalesTensorType`

comptime ScalesTensorType = TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin]

`TensorType`

comptime TensorType = TileTensor[fp8_dtype, output_layout, MutUntrackedOrigin]

`top_k`

comptime top_k = _top_k

Methods

`init`

def __init__(output_tokens: TileTensor[fp8_dtype, output_layout, address_space=output_tokens.address_space, linear_idx_type=output_tokens.linear_idx_type], output_scales: TileTensor[scales_dtype, scales_layout, address_space=output_scales.address_space, linear_idx_type=output_scales.linear_idx_type]) -> Self

`get_type_name`

static def get_type_name() -> String

Returns:

String

`fp8_quant_size`

static def fp8_quant_size() -> Int

Returns:

Int

`scales_size`

static def scales_size() -> Int

Returns:

Int

`token_size`

static def token_size() -> Int

Returns:

Int

`scales_offset`

static def scales_offset() -> Int

Returns:

Int

`pad_expert_offsets`

def pad_expert_offsets[n_groups: Int](self, row_offsets: Pointer[UInt32, address_space=row_offsets.address_space, _safe=False])

The mojo blockwise FP8 grouped matmul requires each group's m to be aligned to the expert_m_padding. This function updates the row_offsets tensor to satisfy this requirement.

For example, if the expert_m_padding is 4, and the row_offsets tensor is [0, 10, 20, 30, 40], the function will update the row_offsets tensor to [0, 12, 24, 36, 48].

`copy_token_to_send_buf`

static def copy_token_to_send_buf[src_type: DType, block_size: Int, buf_addr_space: AddressSpace = AddressSpace.GENERIC](buf_p: Pointer[UInt8, address_space=buf_addr_space, _safe=False], src_p: Pointer[Scalar[src_type], address_space=src_p.address_space, _safe=False], input_scale: Float32)

`copy_msg_to_output_tensor`

def copy_msg_to_output_tensor[buf_addr_space: AddressSpace = AddressSpace.GENERIC](self, buf_p: Pointer[UInt8, address_space=buf_addr_space, _safe=False], token_index: Int, expert_slot: Int = Int(0), expert_start: Int = Int(0))

Parameters​

Fields​

Implemented traits​

comptime members​

alignment​

device_type​

dispatch_smem_size​

dispatch_wait_tile_shape​

expert_m_padding​

group_size​

hid_dim​

ScalesTensorType​

TensorType​

top_k​

Methods​

__init__​

get_type_name​

fp8_quant_size​

scales_size​

token_size​

scales_offset​

pad_expert_offsets​

copy_token_to_send_buf​

copy_msg_to_output_tensor​

Parameters

Fields

Implemented traits

`comptime` members

`alignment`

`device_type`

`dispatch_smem_size`

`dispatch_wait_tile_shape`

`expert_m_padding`

`group_size`

`hid_dim`

`ScalesTensorType`

`TensorType`

`top_k`

Methods

`init`

`get_type_name`

`fp8_quant_size`

`scales_size`

`token_size`

`scales_offset`

`pad_expert_offsets`

`copy_token_to_send_buf`

`copy_msg_to_output_tensor`