Skip to main content

Mojo module

im2col_matmul_2d

Explicit im2col + _matmul_gpu dispatch for 2D convolution.

Materialises an im2col [M, K] scratch into global memory and calls the generic _matmul_gpu on it. _matmul_gpu auto-routes to SM100 UMMA on Blackwell for bf16, giving non-128-aligned-channel 2-D convs access to tensor cores without the TMA im2col descriptor layer.

  • M = batch * H_out * W_out (linearized output pixel)
  • K = R * S * C_in (filter-flattened reduction axis)
  • N = C_out (output channels)

Gate: bf16, groups=1, dilation=1, kernel > 1×1 (the vectorized naive kernel wins on 1×1), K >= 16 (below MMA_K).

Functions

Was this page helpful?