For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

PreshuffledBLoader

struct PreshuffledBLoader[N: Int, K_BYTES: Int, cache_policy: CacheOperation = CacheOperation.ALWAYS]

Per-lane B fragment loader from preshuffled GMEM (DRAM -> VGPR direct).

The 5D layout places each lane's 16-byte fragment at a contiguous DRAM offset, so a single buffer_load_dwordx4 per lane delivers the MFMA's B operand with no LDS staging. OOB lanes are clamped to zero by the buffer-resource bounds.

Parameters

N (Int): Per-expert N dimension (rows of the logical [N, K_BYTES] tile).
K_BYTES (Int): Per-expert FP4-packed K dimension (= K // 2).
cache_policy (CacheOperation): Cache hint for the B load. Defaults to ALWAYS (normal cached, flydsl b_nt=0); set STREAMING (NT=1, flydsl b_nt=2) to skip caching B fragments that are streamed once and never reused.

Fields

bc (AMDBufferResource):

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable, RegisterPassable, TrivialRegisterPassable

Methods

`init`

def __init__(b_gmem_tile: TileTensor[DType.uint8, Storage=b_gmem_tile.Storage, address_space=b_gmem_tile.address_space, linear_idx_type=b_gmem_tile.linear_idx_type]) -> Self

Builds the V# from a preshuffled per-expert B byte buffer.

Args:

b_gmem_tile (TileTensor[DType.uint8, Storage=b_gmem_tile.Storage, address_space=b_gmem_tile.address_space, linear_idx_type=b_gmem_tile.linear_idx_type]): Preshuffled per-expert B byte buffer holding the [N, K_BYTES] logical tile, as produced by mxfp4_preshuffle_layouts.

`load_fragment`

def load_fragment(self, n: Int, k_byte: Int) -> SIMD[DType.uint8, SIMDLength(16)]

Loads the 16-byte B fragment at logical (n, k_byte).

For one MFMA dispatch a lane calls this with (n = warp_n_off + n_mma * 16 + lane % 16, k_byte = k_tile * 64 + (lane // 16) * 16).

Args:

n (Int): Logical N row index into the [N, K_BYTES] tile.
k_byte (Int): Logical K byte index into the [N, K_BYTES] tile.

Returns:

SIMD[DType.uint8, SIMDLength(16)]

Parameters​

Fields​

Implemented traits​

Methods​

__init__​

load_fragment​