Mojo module
matmul_output
Aliases
elementwise_lambda_type
alias elementwise_lambda_type = fn[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], mut SIMD[dtype, width]) capturing -> None
Structs
-
MMATileCoords
: Coordinates for an MMA tile in the output. -
ThreadInfo
: Thread identification within the warp group.
Functions
-
apply_epilogue_to_output_tile
: Apply the epilogue lambda function to the output data. -
calculate_output_tile_bounds
: Calculate the output bounds for the current thread block. -
handle_optimized_bfloat16_output
: Handle output using st.matrix instructions for optimized bf16 output. -
store_accumulator_fragments_to_shared_memory
: Store accumulator fragments from registers to shared memory using st.matrix instructions. -
store_output_tile_direct
: -
store_output_tile_via_tma
: Store output tile to global memory using Tensor Memory Accelerator (TMA). -
write_gemm_output_aligned
: Simplified aligned output - N divisible by BN, no column bounds check needed. -
write_gemm_output_to_global_memory
: Write matrix multiplication output from registers to global memory. -
write_gemm_output_with_bounds_check
: Simplified bounds-checked output - handles arbitrary matrix dimensions.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!