For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

causal_conv1d

Core Causal Conv1D Kernel Implementations.

This module provides CPU and GPU kernel implementations for causal 1D convolution, supporting both channel-first and channel-last memory layouts.

Causal Convolution: In causal convolution, the output at position i depends only on inputs at positions [i - width + 1, i]. This ensures no information leakage from future positions, making it suitable for autoregressive sequence modeling.

Mathematical form for width=4:
    out[i] = sum(x[i-3:i+1] * w[0:4]) + bias  (with boundary handling)

Kernel Categories:

1. Forward Kernels (CPU & GPU):
    - `causal_conv1d_channel_first_fwd_cpu[_no_bias]`
    - `causal_conv1d_channel_last_fwd_cpu[_with_seq_idx][_no_bias]`
    - `causal_conv1d_channel_first_fwd_gpu[_with_seq_idx][_no_bias]`
    - `causal_conv1d_channel_last_fwd_gpu[_with_seq_idx][_no_bias]`

    SIMD-vectorized implementations with compile-time width specialization.
    Supported widths: 1, 2, 3, 4.

2. Update Kernels (for autoregressive decode):
    - `causal_conv1d_update_cpu[_no_bias]`
    - `causal_conv1d_update_gpu[_no_bias]`

    Incremental update operations that maintain conv state for efficient
    autoregressive token generation.

Memory Layouts: - Channel-first (B, C, L): Standard layout, contiguous channels per position. - Channel-last (B, L, C): Contiguous positions per channel, used by some frameworks.

GPU Optimization Parameters: - kNThreads=128: Threads per block for sequence processing - kNElts=4: Elements processed per thread for better ILP - SIMD width 4: Vectorized weight loading for width=4 kernels

Activation Support: - None: Direct convolution output - SiLU: Sigmoid Linear Unit activation (x * sigmoid(x))

Functions

causal_conv1d_channel_first_fwd_cpu: CPU implementation of causal conv1d for channel-first layout with bias.
causal_conv1d_channel_first_fwd_cpu_no_bias: CPU implementation of causal conv1d for channel-first layout without bias.
causal_conv1d_channel_first_fwd_gpu: Optimized GPU implementation of causal conv1d for channel-first layout with bias.
causal_conv1d_channel_first_fwd_gpu_no_bias: Optimized causal conv1d implementation for channel first data layout using SIMD operations (no bias).
causal_conv1d_channel_first_fwd_gpu_no_bias_with_seq_idx: Optimized causal conv1d implementation for channel-first data layout using SIMD operations (no bias) with seq_idx support.
causal_conv1d_channel_first_fwd_gpu_with_seq_idx: Optimized causal conv1d implementation for channel-first data layout using SIMD operations with seq_idx support.
causal_conv1d_channel_last_fwd_cpu: Optimized CPU implementation of causal conv1d for channel-last layout with bias.
causal_conv1d_channel_last_fwd_cpu_no_bias: Optimized CPU implementation of causal conv1d for channel-last layout without bias.
causal_conv1d_channel_last_fwd_cpu_no_bias_with_seq_idx: Optimized implementation of causal conv1d for channel last data layout without bias but with seq_idx.
causal_conv1d_channel_last_fwd_cpu_with_seq_idx: Optimized implementation of causal conv1d for channel last data layout with seq_idx.
causal_conv1d_channel_last_fwd_gpu: Optimized causal conv1d implementation for channel last data layout using SIMD operations.
causal_conv1d_channel_last_fwd_gpu_no_bias: Optimized causal conv1d implementation for channel last data layout using SIMD operations (no bias).
causal_conv1d_channel_last_fwd_gpu_no_bias_with_seq_idx: Optimized causal conv1d implementation for channel last data layout using SIMD operations (no bias) with seq_idx support.
causal_conv1d_channel_last_fwd_gpu_with_seq_idx: Optimized causal conv1d implementation for channel last data layout using SIMD operations with seq_idx support.
causal_conv1d_update_cpu: CPU implementation of causal conv1d update for incremental inference.
causal_conv1d_update_cpu_no_bias: CPU implementation of causal conv1d update without bias.
causal_conv1d_update_gpu: GPU kernel for causal conv1d update operation (for autoregressive decode).
causal_conv1d_update_gpu_no_bias: GPU kernel for causal conv1d update operation without bias (for autoregressive decode).

Functions​

Functions