Mojo module

dispatch

Dispatch logic for grouped 1D-1D block-scaled SM100 NVFP4 matmul.

Selects optimal kernel configuration based on (N, K) shape and workload type (decode vs prefill), with parameters tuned via ablation on B200.

When override=True, uses the caller's AB_swapped/mma_bn/cta_group/ num_pipeline_stages directly (for ablation studies and benchmarking). When override=False (default), ignores those parameters and selects from the tuning table based on (N, K) and is_decode.

Tuning table (keyed on N, K, is_decode):

Decode (is_decode=True):

N=4096, K=7168: AB_swapped=True, mma_bn=8, cta_group=1, stages=6
N=7168, K=2048: AB_swapped=True, mma_bn=8, cta_group=1, stages=4
Default: AB_swapped=True, mma_bn=8, cta_group=1, stages=auto

Prefill (is_decode=False):

N=4096, K=7168: AB_swapped=True, mma_bn=128, cta_group=2, stages=7
N=7168, K=2048: AB_swapped=True, mma_bn=128, cta_group=2, stages=6
Default: AB_swapped=True, mma_bn=128, cta_group=2, stages=auto

Functions

grouped_matmul_nvfp4_dispatch: Dispatch grouped NVFP4 matmul with shape-tuned configuration.

Functions​

Functions