Attention mask

An attention mask specifies which tokens in a sequence a model can attend to during attention score computation. This prevents the model from attending to tokens it should ignore. For example, when sequences in a batch are padded to the same length, an attention mask prevents the model from attending to padding tokens, which carry no meaningful information.

Causal mask

In transformer models, self-attention — a specific form of attention where a sequence attends to itself — allows every token to attend to all other tokens simultaneously, with no inherent notion of order. Autoregressive language models, however, must generate tokens sequentially, meaning each token is conditioned only on preceding tokens. The causal mask (also called a look-ahead mask) resolves this tension by preventing the self-attention layer from attending to future tokens, ensuring that each token's representation incorporates information only from tokens at previous positions.

Concretely, the causal mask is a matrix that sets attention scores to negative infinity for future positions. After the softmax operation, these negative-infinity values become zero, blocking information flow from later tokens to earlier ones.

The causal mask is essential during training, where the model processes entire sequences in parallel and must be prevented from attending to tokens it should be predicting. The same constraint applies during inference at the context encoding (also called prefill) phase, where all input tokens are likewise processed in parallel. Without the causal mask, information from later tokens would corrupt the representations of earlier tokens, producing attention scores that differ from what the model learned during training.

Note that during the decode phase of inference, the causal mask is effectively redundant: the model generates one token at a time and attends only to the KV cache of previously-seen tokens, so there are no future tokens to mask.

Causal mask​

Causal mask