Mojo module
warp_spec_matmul
AMD Warp-Specialized Matrix Multiplication
Architecture Overview:
- Producer warps: Load tiles from global to shared memory
- A producers: Load M×K tiles from matrix A
- B producers: Load N×K tiles from matrix B
- Consumer warps: Perform matrix multiplication using shared memory tiles
- Ring buffer: Coordinates producer-consumer synchronization with barriers
Data Flow:
- Producers load tiles into shared memory stages
- Barriers ensure data is ready before consumers access it
- Consumers compute partial results and accumulate
- Final results written back to global memory
Memory Layout:
- Shared memory is divided into pipeline stages for overlapping
- Each stage contains block tiles that are further divided into warp tiles
- Swizzling may be applied to avoid bank conflicts
Aliases
GlobalTensor
alias GlobalTensor[dtype: DType, layout: Layout] = LayoutTensor[dtype, layout, MutAnyOrigin, address_space=AddressSpace.GLOBAL]
Parameters
Functions
-
determine_thread_role: Returns (role, consumer_warp_id within role group). -
get_producer_warp_thread_layout: -
lgkm_wait: -
run_producer: Generic producer function for loading matrix tiles from global to shared memory. -
smem_tile_layout: -
store_c: -
validate_config: Validates the configuration parameters for the matrix multiplication kernel. -
warp_specialized_matmul:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!