Skip to main content

Mojo module

warp_spec_matmul

AMD Warp-Specialized Matrix Multiplication

Architecture Overview:

  • Producer warps: Load tiles from global to shared memory
    • A producers: Load M×K tiles from matrix A
    • B producers: Load N×K tiles from matrix B
  • Consumer warps: Perform matrix multiplication using shared memory tiles
  • Ring buffer: Coordinates producer-consumer synchronization with barriers

Data Flow:

  1. Producers load tiles into shared memory stages
  2. Barriers ensure data is ready before consumers access it
  3. Consumers compute partial results and accumulate
  4. Final results written back to global memory

Memory Layout:

  • Shared memory is divided into pipeline stages for overlapping
  • Each stage contains block tiles that are further divided into warp tiles
  • Swizzling may be applied to avoid bank conflicts

Aliases

GlobalTensor

alias GlobalTensor[dtype: DType, layout: Layout] = LayoutTensor[dtype, layout, MutAnyOrigin, address_space=AddressSpace.GLOBAL]

Parameters

Functions

Was this page helpful?