For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

causal_conv1d_channel_last_fwd_gpu

def causal_conv1d_channel_last_fwd_gpu[x_dtype: DType, weight_dtype: DType, output_dtype: DType, kNThreads: Int, kWidth: Int, kNElts: Int, bias_dtype: DType, x_LT: TensorLayout, weight_LT: TensorLayout, output_LT: TensorLayout, bias_LT: TensorLayout](batch: Int, dim: Int, seqlen: Int, width: Int, x: TileTensor[x_dtype, x_LT, MutUntrackedOrigin], weight: TileTensor[weight_dtype, weight_LT, MutUntrackedOrigin], output: TileTensor[output_dtype, output_LT, MutUntrackedOrigin], bias: TileTensor[bias_dtype, bias_LT, MutUntrackedOrigin], x_batch_stride: UInt32, x_c_stride: UInt32, x_l_stride: UInt32, weight_c_stride: UInt32, weight_width_stride: UInt32, out_batch_stride: UInt32, out_c_stride: UInt32, out_l_stride: UInt32, silu_activation: Int8)

Optimized causal conv1d implementation for channel last data layout using SIMD operations.

Key optimizations:

SIMD vectorization for input/output operations across channels
Efficient memory access patterns with coalesced loads using vectorized tensor views
Vectorized weight loading and computation
Chunked processing of multiple sequence positions per thread
Optimized activation function with SIMD operations
Better thread utilization and memory bandwidth usage

For channel-last layout (B, L, C), we reshape to (B*L, C) to enable vectorized operations along channels, and process multiple sequence positions per thread.