Skip to main content
Log in

Python module

pipeline

HF Token Generation Pipeline

ModelOutputs

class max.pipelines.pipeline.ModelOutputs(next_token_logits: 'Tensor', logits: 'Tensor | None' = None)

logits

logits*: Tensor | None* = None

Logits for the entire token sequence.

next_token_logits

next_token_logits*: Tensor*

Logits for just the next token.

PipelineModel

class max.pipelines.pipeline.PipelineModel(pipeline_config: PipelineConfig, session: InferenceSession)

A pipeline model with setup, input preparation and execution methods.

compute_log_probabilities()

compute_log_probabilities(model_inputs: Sequence[Tensor], model_outputs: ModelOutputs, next_tokens: Tensor, batch_top_n: list[int], batch_echo: list[bool]) → list[max.pipelines.response.LogProbabilities | None] | None

Optional method that can be overridden to compute log probabilities.

  • Parameters:

    • model_inputs – Inputs to the model returned by prepare_*_token_inputs().
    • model_outputs – Outputs returned by execute().
    • next_tokens – Sampled tokens. Should have shape=[batch size]
    • batch_top_n – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
    • batch_echo – Whether to include input tokens in the returned log probabilities.
  • Returns:

    List of log probabilities.

estimate_kv_cache_size()

abstract estimate_kv_cache_size(available_cache_memory: int) → int

Estimates the size of the kv cache in bytes.

estimate_memory_footprint()

estimate_memory_footprint() → int

Calculates the estimated memory consumption of our engine and returns the estimated available space to store the KVCache.

execute()

abstract execute(*model_inputs: Tensor) → ModelOutputs

Runs the graph.

load_kv_manager()

abstract load_kv_manager(session: InferenceSession, available_cache_memory: int) → KVCacheManager

Provided a PipelineConfig and InferenceSession, load the kv manager.

load_model()

abstract load_model(session: InferenceSession) → Model

Provided a PipelineConfig and InferenceSession, build and load the model graph.

prepare_initial_token_inputs()

abstract prepare_initial_token_inputs(context_batch: Sequence[T]) → tuple[max.driver.tensor.Tensor, ...]

Prepares the initial inputs to be passed to .execute().

The inputs and functionality of this method can vary per model. For example, the model inputs could include:

  • Encoded tensors
  • A unique IDs for each tensor if this model uses a KV Cache manager.

This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors.

prepare_next_token_inputs()

abstract prepare_next_token_inputs(next_tokens: Tensor, prev_model_inputs: tuple[max.driver.tensor.Tensor, ...]) → tuple[max.driver.tensor.Tensor, ...]

Prepares the secondary inputs to be passed to .execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

TextGenerationPipeline

class max.pipelines.pipeline.TextGenerationPipeline(pipeline_config: PipelineConfig, pipeline_model: Type[PipelineModel], eos_token_id: int)

Generalized token generator pipeline.

next_token()

next_token(batch: dict[str, T], num_steps: int = 1) → list[dict[str, Any]]

Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.

release()

release(context: T) → None

Mark the context as complete, releasing the cache slot from the KV manager.