IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

single_group_router_kernel

def single_group_router_kernel[scores_type: DType, bias_type: DType, ExpertIndicesLayoutType: TensorLayout, ExpertWeightsLayoutType: TensorLayout, ExpertScoresLayoutType: TensorLayout, ExpertBiasLayoutType: TensorLayout, n_routed_experts: Int, n_experts_per_tok: Int, norm_weights: Bool, num_threads: Int, scores_input_fn: OptionalReg[def[width: Int](IndexList[2]) capturing -> SIMD[scores_type, width]] = None](expert_indices: TileTensor[DType.int32, ExpertIndicesLayoutType, MutAnyOrigin], expert_weights: TileTensor[scores_type, ExpertWeightsLayoutType, MutAnyOrigin], expert_scores: TileTensor[scores_type, ExpertScoresLayoutType, ImmutAnyOrigin], expert_bias: TileTensor[bias_type, ExpertBiasLayoutType, ImmutAnyOrigin], routed_scaling_factor: Float32)

Single-group MoE router kernel. One block per token, one thread per expert.

Fuses: corrected = scores + bias β†’ top-k selection β†’ weight = corrected - bias β†’ optional normalize β†’ scale. Uses warp-bitonic sort across 2 or 3 phases depending on WARP_SIZE. NVIDIA (WARP_SIZE=32): 3-phase. AMD (WARP_SIZE=64): 2-phase (phase 2 eliminated at compile time when phase1_candidates fits in one wavefront).