Python module
pipeline
HF Token Generation Pipeline
ModelOutputs
class max.pipelines.pipeline.ModelOutputs(next_token_logits: 'Tensor', logits: 'Tensor | None' = None)
logits
Logits for the entire token sequence.
next_token_logits
next_token_logits*: Tensor*
Logits for just the next token.
PipelineModel
class max.pipelines.pipeline.PipelineModel(pipeline_config: PipelineConfig, session: InferenceSession)
A pipeline model with setup, input preparation and execution methods.
compute_log_probabilities()
compute_log_probabilities(model_inputs: Sequence[Tensor], model_outputs: ModelOutputs, next_tokens: Tensor, batch_top_n: list[int], batch_echo: list[bool]) → list[max.pipelines.response.LogProbabilities | None] | None
Optional method that can be overridden to compute log probabilities.
-
Parameters:
- model_inputs – Inputs to the model returned by prepare_*_token_inputs().
- model_outputs – Outputs returned by execute().
- next_tokens – Sampled tokens. Should have shape=[batch size]
- batch_top_n – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
- batch_echo – Whether to include input tokens in the returned log probabilities.
-
Returns:
List of log probabilities.
estimate_kv_cache_size()
abstract estimate_kv_cache_size() → int
Estimates the size of the kv cache in bytes.
estimate_memory_footprint()
estimate_memory_footprint()
execute()
abstract execute(*model_inputs: Tensor) → ModelOutputs
Runs the graph.
load_kv_manager()
abstract load_kv_manager(session: InferenceSession) → KVCacheManager
Provided a PipelineConfig and InferenceSession, load the kv manager.
load_model()
abstract load_model(session: InferenceSession) → Model
Provided a PipelineConfig and InferenceSession, build and load the model graph.
prepare_initial_token_inputs()
abstract prepare_initial_token_inputs(context_batch: Sequence[T]) → tuple[max.driver.tensor.Tensor, ...]
Prepares the initial inputs to be passed to .execute().
The inputs and functionality of this method can vary per model. For example, the model inputs could include:
- Encoded tensors
- A unique IDs for each tensor if this model uses a KV Cache manager.
This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors.
prepare_next_token_inputs()
abstract prepare_next_token_inputs(next_tokens: Tensor, prev_model_inputs: tuple[max.driver.tensor.Tensor, ...]) → tuple[max.driver.tensor.Tensor, ...]
Prepares the secondary inputs to be passed to .execute().
While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.
TextGenerationPipeline
class max.pipelines.pipeline.TextGenerationPipeline(pipeline_config: PipelineConfig, pipeline_model: Type[PipelineModel], eos_token_id: int)
Generalized token generator pipeline.
next_token()
next_token(batch: dict[str, T], num_steps: int = 1) → list[dict[str, Any]]
Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.
release()
release(context: T) → None
Mark the context as complete, releasing the cache slot from the KV manager.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!