Skip to main content

/

Mojo module

warp_spec_matmul

AMD Warp-Specialized Matrix Multiplication

Architecture Overview:

Producer warps: Load tiles from global to shared memory
- A producers: Load M×K tiles from matrix A
- B producers: Load N×K tiles from matrix B
Consumer warps: Perform matrix multiplication using shared memory tiles
Ring buffer: Coordinates producer-consumer synchronization with barriers

Data Flow:

Producers load tiles into shared memory stages
Barriers ensure data is ready before consumers access it
Consumers compute partial results and accumulate
Final results written back to global memory

Memory Layout:

Shared memory is divided into pipeline stages for overlapping
Each stage contains block tiles that are further divided into warp tiles
Swizzling may be applied to avoid bank conflicts

Ring Buffer Configuration:

Uses SingleCounterSync strategy by default (single atomic counter per tile)
Can be changed to SplitCounterSync in the RingBuffer type aliases for reduced contention
The trait-based design allows easy experimentation with different sync strategies

`comptime` values

`GlobalTensor`

comptime GlobalTensor[dtype: DType, layout: Layout] = LayoutTensor[dtype, layout, MutAnyOrigin, address_space=AddressSpace.GLOBAL]

Parameters

dtype (DType):
layout (Layout):

Functions

determine_thread_role: Returns (role, consumer_warp_id within role group).
get_producer_warp_thread_layout:
lgkm_wait:
run_producer: Generic producer function for loading matrix tiles from global to shared memory.
smem_tile_layout:
validate_config: Validates the configuration parameters for the matrix multiplication kernel.
warp_specialized_matmul:
warp_specialized_matmul_kernel:

comptime values
- GlobalTensor
Functions

View source

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!