Batching

The process of combining multiple inference requests into a single forward pass through the model, thus executing multiple requests simultaneously and improving computational efficiency. To account for requests with varying sequence lengths, it's common to add techniques such as padding (to standardize lengths) or ragged tensors (to handle variable lengths directly).

Batch sizes can be either static or dynamic. Whereas static batching uses a fixed batch size and thus waits until the system receives a specific number of inference requests before sending them into the model, dynamic batching uses a flexible batch size. For example, dynamic batching may send a batch of requests to the model as soon as the batch either reaches a certain number of requests (batch size limit) or it reaches a timeout threshold.

Dynamic batching can get a lot more complicated than that with additional tricks that keep GPUs busy instead of waiting for one batch to finish before starting another. One such strategy for large language models (LLMs) is continuous batching.