Python module

pipeline

HF Token Generation Pipeline

`KVCacheMixin`

class max.pipelines.pipeline.KVCacheMixin(*args, **kwargs)

`estimate_kv_cache_size()`

estimate_kv_cache_size(available_cache_memory: int) → int

Estimates the size of the kv cache in bytes.

`load_kv_manager()`

load_kv_manager(session: InferenceSession, available_cache_memory: int) → KVCacheManager

Provided a PipelineConfig and InferenceSession, loads the KV manager.

Parameters:
- session – Inference session to compile and init the KV cache.
- available_cache_memory – Amount of memory available to the KV cache, in bytes.
Returns:

one per input modality.
Return type:

Either a single KV cache manager or a tuple of KV cache managers

`ModelOutputs`

class max.pipelines.pipeline.ModelOutputs(next_token_logits: 'Tensor | None' = None, logits: 'Tensor | None' = None)

`logits`

logits*: Tensor | None* = None

Logits for the entire token sequence.

`next_token_logits`

next_token_logits*: Tensor | None* = None

Logits for just the next token.

`PipelineModel`

class max.pipelines.pipeline.PipelineModel(pipeline_config: PipelineConfig, session: InferenceSession)

A pipeline model with setup, input preparation and execution methods.

`compute_log_probabilities()`

compute_log_probabilities(model_inputs: Iterable[Any], model_outputs: ModelOutputs, next_tokens: Tensor, batch_top_n: list[int], batch_echo: list[bool]) → list[max.pipelines.response.LogProbabilities | None] | None

Optional method that can be overridden to compute log probabilities.

Parameters:
- model_inputs – Inputs to the model returned by prepare_*_token_inputs().
- model_outputs – Outputs returned by execute().
- next_tokens – Sampled tokens. Should have shape=[batch size]
- batch_top_n – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
- batch_echo – Whether to include input tokens in the returned log probabilities.
Returns:

List of log probabilities.

`estimate_memory_footprint()`

estimate_memory_footprint() → int

Calculates the estimated memory consumption of our engine and returns the estimated available space to store the KVCache.

`execute()`

abstract execute(model_inputs: Any) → ModelOutputs

Runs the graph.

`prepare_initial_token_inputs()`

abstract prepare_initial_token_inputs(context_batch: Sequence[T]) → Iterable[Any]

Prepares the initial inputs to be passed to .execute().

The inputs and functionality of this method can vary per model. For example, the model inputs could include:

Encoded tensors
A unique IDs for each tensor if this model uses a KV Cache manager.

This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors.

`prepare_next_token_inputs()`

abstract prepare_next_token_inputs(next_tokens: Tensor, prev_model_inputs: Any) → Any

Prepares the secondary inputs to be passed to .execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

`TextGenerationPipeline`

class max.pipelines.pipeline.TextGenerationPipeline(pipeline_config: PipelineConfig, pipeline_model: Type[PipelineModel], eos_token_id: int)

Generalized token generator pipeline.

`next_token()`

next_token(batch: dict[str, T], num_steps: int = 1) → list[dict[str, Any]]

Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.

`release()`

release(context: T) → None

Mark the context as complete, releasing the cache slot from the KV manager.

KVCacheMixin​

estimate_kv_cache_size()​

load_kv_manager()​

ModelOutputs​

logits​

next_token_logits​

PipelineModel​

compute_log_probabilities()​

estimate_memory_footprint()​

execute()​

prepare_initial_token_inputs()​

prepare_next_token_inputs()​

TextGenerationPipeline​

next_token()​

release()​