Mojo function

single_group_router_kernel

single_group_router_kernel[scores_type: DType, bias_type: DType, ExpertIndicesLayoutType: TensorLayout, ExpertWeightsLayoutType: TensorLayout, ExpertScoresLayoutType: TensorLayout, ExpertBiasLayoutType: TensorLayout, n_routed_experts: Int, n_experts_per_tok: Int, norm_weights: Bool, num_threads: Int, scores_input_fn: OptionalReg[def[width: Int](IndexList[2]) capturing -> SIMD[scores_type, width]] = None](expert_indices: TileTensor[DType.int32, ExpertIndicesLayoutType, MutAnyOrigin], expert_weights: TileTensor[scores_type, ExpertWeightsLayoutType, MutAnyOrigin], expert_scores: TileTensor[scores_type, ExpertScoresLayoutType, ImmutAnyOrigin], expert_bias: TileTensor[bias_type, ExpertBiasLayoutType, ImmutAnyOrigin], routed_scaling_factor: Float32)

Single-group MoE router kernel. One block per token, one thread per expert.

Fuses: corrected = scores + bias → top-k selection → weight = corrected - bias → optional normalize → scale. Uses warp-bitonic sort across 2 or 3 phases depending on WARP_SIZE. NVIDIA (WARP_SIZE=32): 3-phase. AMD (WARP_SIZE=64): 2-phase (phase 2 eliminated at compile time when phase1_candidates fits in one wavefront).