For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

mla_decode_branch_bf16

def mla_decode_branch_bf16[collection_t: KVCollectionT, //, mask_str: StringSlice[StaticConstantOrigin], kv_input_fn: def[width: Int](IndexList[Int(2)]) capturing -> SIMD[DType.bfloat16, width], target: StringSlice[StaticConstantOrigin] = StringSlice("cpu")](output: TileTensor[DType.bfloat16, Storage=output.Storage, linear_idx_type=output.linear_idx_type, element_size=output.element_size], q: TileTensor[DType.bfloat16, Storage=q.Storage, linear_idx_type=q.linear_idx_type, element_size=q.element_size], input_row_offsets: TileTensor[DType.uint32, Storage=input_row_offsets.Storage, linear_idx_type=input_row_offsets.linear_idx_type, element_size=input_row_offsets.element_size], freqs_cis: TileTensor[Storage=freqs_cis.Storage, linear_idx_type=freqs_cis.linear_idx_type, element_size=freqs_cis.element_size], kv_norm_gamma: TileTensor[Storage=kv_norm_gamma.Storage, linear_idx_type=kv_norm_gamma.linear_idx_type, element_size=kv_norm_gamma.element_size], kv_collection: collection_t, layer_idx: UInt32, scale: Float32, epsilon: Float32, w_uk: TileTensor[DType.bfloat16, Storage=w_uk.Storage, linear_idx_type=w_uk.linear_idx_type, element_size=w_uk.element_size], w_uv: TileTensor[DType.bfloat16, Storage=w_uv.Storage, linear_idx_type=w_uv.linear_idx_type, element_size=w_uv.element_size], scalar_args_buf: TileTensor[DType.int64, Storage=scalar_args_buf.Storage, linear_idx_type=scalar_args_buf.linear_idx_type, element_size=scalar_args_buf.element_size], ctx: DeviceContext, num_partitions_in: Optional[Int] = None)

BF16 MLA decode path.

Applies RoPE and RMSNorm, projects q_nope to latent space, concatenates with q_rope, and runs decode.