For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

gemv_and_partial_norm

def gemv_and_partial_norm[c_type: DType, a_type: DType, //, *, transpose_b: Bool = True, fused: Bool = True, tile_n: Int = Int(4), num_threads: Int = Int(256), pdl_level: PDLLevel = PDLLevel()](normed_output: TileTensor[c_type, Storage=normed_output.Storage, address_space=normed_output.address_space, linear_idx_type=normed_output.linear_idx_type, element_size=normed_output.element_size], unnormed_output: TileTensor[c_type, Storage=unnormed_output.Storage, address_space=unnormed_output.address_space, linear_idx_type=unnormed_output.linear_idx_type, element_size=unnormed_output.element_size], act: TileTensor[a_type, Storage=act.Storage, address_space=act.address_space, linear_idx_type=act.linear_idx_type, element_size=act.element_size], weight: TileTensor[a_type, Storage=weight.Storage, address_space=weight.address_space, linear_idx_type=weight.linear_idx_type, element_size=weight.element_size], gamma: TileTensor[a_type, Storage=gamma.Storage, address_space=gamma.address_space, linear_idx_type=gamma.linear_idx_type, element_size=gamma.element_size], eps: Scalar[a_type], ctx: DeviceContext)

Computes y = act @ weight.T, then partitions y into a normed front and an unnormed tail.

Parameters:

c_type (DType): Output dtype.
a_type (DType): Activation / weight / gamma dtype.
transpose_b (Bool): If True, weight is row-major [N, K] used as weight.T.
fused (Bool): Compile-time flag. True (default) selects the single- kernel fused path (M=1 only). False selects the 2-launch baseline (matmul + rms_norm_gpu; the unnormed tail is a view into the matmul output, so unnormed_output is left untouched).
tile_n (Int): Comptime tile width in columns (fused only).
num_threads (Int): Comptime threads per block (fused only).
pdl_level (PDLLevel): Programmatic Dependent Launch level.

Args:

normed_output (TileTensor[c_type, Storage=normed_output.Storage, address_space=normed_output.address_space, linear_idx_type=normed_output.linear_idx_type, element_size=normed_output.element_size]): [M, N_normed] output buffer. Receives rms_norm(y[:, :N_normed], gamma, eps) in both paths.
unnormed_output (TileTensor[c_type, Storage=unnormed_output.Storage, address_space=unnormed_output.address_space, linear_idx_type=unnormed_output.linear_idx_type, element_size=unnormed_output.element_size]): [M, N - N_normed] output buffer. The fused path writes y[:, N_normed:] here; the unfused path leaves this untouched (the unnormed tail is a view into the internally-allocated matmul scratch).
act (TileTensor[a_type, Storage=act.Storage, address_space=act.address_space, linear_idx_type=act.linear_idx_type, element_size=act.element_size]): [M, K] activations.
weight (TileTensor[a_type, Storage=weight.Storage, address_space=weight.address_space, linear_idx_type=weight.linear_idx_type, element_size=weight.element_size]): [N, K] weights (when transpose_b=True).
gamma (TileTensor[a_type, Storage=gamma.Storage, address_space=gamma.address_space, linear_idx_type=gamma.linear_idx_type, element_size=gamma.element_size]): [N_normed] RMS norm scale.
eps (Scalar[a_type]): RMS norm epsilon.
ctx (DeviceContext): Device context.

Raises:

Error: If _matmul_gpu or rms_norm_gpu fail to launch, or if internal scratch allocation fails.