Context encoding

Also known as "prefill," context encoding is the first phase in a transformer model that converts input data into a cached numerical representation (KV cache) and predicts the first token. It occurs after the input has already been tokenized (preprocessed).

Context encoding is then followed by the autoregressive token generation phase, which produces one token at a time. If it weren't for the KV cache built during context encoding, the model would have to recalculate the self-attention score for each token in the original input, every time it starts to predict a new token.

Context encoding is usually the most computationally expensive phase in an LLM, because it must calculate attention scores for every token in the input sequence. Although this process may be parallelized across thousands of GPU threads (because each token can be processed separately), it is still a significant latency factor for time-to-first-token (TTFT). The model can usually produce subsequent tokens much faster than the first one because each round of token generation needs to calculate an attention score for only one token (the new one).