Mojo function

load_tmem_fragments

load_tmem_fragments[accum_type: DType, epilogue_type: DType, frag_size: Int, is_lower_required: Bool, data_paths: Int = 16, bits: Int = 256, repeat: Int = 1](tmem_addr: UInt32) -> Tuple[SIMD[epilogue_type, (frag_size * repeat)], SIMD[epilogue_type, (frag_size * repeat)]]

Load upper and lower fragments from TMEM and cast to epilogue type.

This encapsulates the common pattern of loading accumulator data from tensor memory, waiting for completion, and casting to output type.

Template Parameters: accum_type: Accumulator data type (e.g., float32). epilogue_type: Output data type after casting (e.g., bfloat16). frag_size: Base fragment size per warp. is_lower_required: Whether lower fragment is needed. data_paths: TMEM data paths (default 16). bits: TMEM bits width (default 256). repeat: Repeat factor for larger fragments.

Args:

tmem_addr (UInt32): Tensor memory address for this stage.

Returns:

Tuple: Tuple of (upper_casted, lower_casted) SIMD fragments.