For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
causal_conv1d_channel_last_fwd_gpu
causal_conv1d_channel_last_fwd_gpu[x_dtype: DType, weight_dtype: DType, output_dtype: DType, kNThreads: Int, kWidth: Int, kNElts: Int, bias_dtype: DType, x_LT: TensorLayout, weight_LT: TensorLayout, output_LT: TensorLayout, bias_LT: TensorLayout](batch: Int, dim: Int, seqlen: Int, width: Int, x: TileTensor[x_dtype, x_LT, MutExternalOrigin], weight: TileTensor[weight_dtype, weight_LT, MutExternalOrigin], output: TileTensor[output_dtype, output_LT, MutExternalOrigin], bias: TileTensor[bias_dtype, bias_LT, MutExternalOrigin], x_batch_stride: UInt32, x_c_stride: UInt32, x_l_stride: UInt32, weight_c_stride: UInt32, weight_width_stride: UInt32, out_batch_stride: UInt32, out_c_stride: UInt32, out_l_stride: UInt32, silu_activation: Int8)
Optimized causal conv1d implementation for channel last data layout using SIMD operations.
Key optimizations:
- SIMD vectorization for input/output operations across channels
- Efficient memory access patterns with coalesced loads using vectorized tensor views
- Vectorized weight loading and computation
- Chunked processing of multiple sequence positions per thread
- Optimized activation function with SIMD operations
- Better thread utilization and memory bandwidth usage
For channel-last layout (B, L, C), we reshape to (B*L, C) to enable vectorized operations along channels, and process multiple sequence positions per thread.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!