Continuous batching

A batching technique that can continuously dispatch inference requests to the GPU for token generation and dramatically improve GPU utilization. Continuous batching can start executing a new batch even before the previous batch finishes its pass through the model, because this batching technique schedules new processing at the "token level."

That is, because large language models (LLMs) generate responses one token at a time, there is a repeated cycle during inference (the token generation phase) in which a new batch can jump in to utilize the GPU, even before a previous batch finishes its pass through the model. That's what it means to operate at the "token level"—the batch scheduler focuses on keeping the GPU busy with token generation at all times, instead of waiting for the previous batch to finish its complete forward pass.

This is sometimes called "in-flight batching" in cases where context encoding and token generation requests are combined into the same batch.