IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

causal_conv1d_channel_last_fwd_gpu

causal_conv1d_channel_last_fwd_gpu[x_dtype: DType, weight_dtype: DType, output_dtype: DType, kNThreads: Int, kWidth: Int, kNElts: Int, bias_dtype: DType, x_LT: TensorLayout, weight_LT: TensorLayout, output_LT: TensorLayout, bias_LT: TensorLayout](batch: Int, dim: Int, seqlen: Int, width: Int, x: TileTensor[x_dtype, x_LT, MutExternalOrigin], weight: TileTensor[weight_dtype, weight_LT, MutExternalOrigin], output: TileTensor[output_dtype, output_LT, MutExternalOrigin], bias: TileTensor[bias_dtype, bias_LT, MutExternalOrigin], x_batch_stride: UInt32, x_c_stride: UInt32, x_l_stride: UInt32, weight_c_stride: UInt32, weight_width_stride: UInt32, out_batch_stride: UInt32, out_c_stride: UInt32, out_l_stride: UInt32, silu_activation: Int8)

Optimized causal conv1d implementation for channel last data layout using SIMD operations.

Key optimizations:

  1. SIMD vectorization for input/output operations across channels
  2. Efficient memory access patterns with coalesced loads using vectorized tensor views
  3. Vectorized weight loading and computation
  4. Chunked processing of multiple sequence positions per thread
  5. Optimized activation function with SIMD operations
  6. Better thread utilization and memory bandwidth usage

For channel-last layout (B, L, C), we reshape to (B*L, C) to enable vectorized operations along channels, and process multiple sequence positions per thread.