Mojo function

group_norm_gpu_multi_block_norm

group_norm_gpu_multi_block_norm[OutputLayoutType: TensorLayout, output_origin: MutOrigin, StatsLayoutType: TensorLayout, stats_origin: MutOrigin, //, dtype: DType, simd_width: UInt, input_fn: fn[width: Int](row: Int, col: Int) capturing -> SIMD[dtype, width], gamma_fn: fn[width: Int](IndexList[1]) capturing -> SIMD[dtype, width], beta_fn: fn[width: Int](IndexList[1]) capturing -> SIMD[dtype, width]](output: TileTensor[dtype, OutputLayoutType, output_origin], stats: TileTensor[get_accum_type[dtype](), StatsLayoutType, stats_origin], epsilon: Scalar[dtype], num_groups: Int, channels_per_group: Int, spatial: Int, num_splits: Int, group_size: Int)

Multi-block normalize kernel: reduces partial stats and normalizes.

Grid: num_rows * num_splits blocks. Each block reads all partial stats for its group, reduces to final mean/variance, then normalizes its chunk of elements.