Skip to main content

Mojo module

warp_spec_matmul

AMD Warp-Specialized Matrix Multiplication

Architecture Overview:

  • Producer warps: Load tiles from global to shared memory
    • A producers: Load M×K tiles from matrix A
    • B producers: Load N×K tiles from matrix B
  • Consumer warps: Perform matrix multiplication using shared memory tiles
  • Ring buffer: Coordinates producer-consumer synchronization with barriers

Data Flow:

  1. Producers load tiles into shared memory stages
  2. Barriers ensure data is ready before consumers access it
  3. Consumers compute partial results and accumulate
  4. Final results written back to global memory

Memory Layout:

  • Shared memory is divided into pipeline stages for overlapping
  • Each stage contains block tiles that are further divided into warp tiles
  • Swizzling may be applied to avoid bank conflicts

Ring Buffer Configuration:

  • Uses SingleCounterSync strategy by default (single atomic counter per tile)
  • Can be changed to SplitCounterSync in the RingBuffer type aliases for reduced contention
  • The trait-based design allows easy experimentation with different sync strategies

Aliases

GlobalTensor

comptime GlobalTensor[dtype: DType, layout: Layout] = LayoutTensor[dtype, layout, MutAnyOrigin, address_space=AddressSpace.GLOBAL]

Parameters

Functions

Was this page helpful?