Context encoding
Context encoding is the first phase of inference in a transformer model (also known as the "prefill" stage). During context encoding, the model processes the tokenized input sequence in parallel, computing attention scores for every token. As a byproduct of this computation, the model populates the KV cache with the key and value vectors for each input token, so they don't need to be recomputed during subsequent token generation.
After context encoding, the model enters the autoregressive decode phase, generating one token at a time. Each new token only needs to compute attention against the existing KV cache rather than reprocessing the entire input, which is what makes generation after the first token comparatively fast.
Context encoding is typically the most computationally expensive phase because it must process every input token at once. Although this work can be parallelized across thousands of GPU threads, it is still the primary contributor to time-to-first-token (TTFT) latency.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!