Skip to main content
Log in

Python module

pipeline

HF Token Generation Pipeline

KVCacheMixin

class max.pipelines.pipeline.KVCacheMixin(*args, **kwargs)

estimate_kv_cache_size()

estimate_kv_cache_size(available_cache_memory: int) → int

Estimates the size of the kv cache in bytes.

load_kv_manager()

load_kv_manager(session: InferenceSession, available_cache_memory: int) → KVCacheManager

Provided a PipelineConfig and InferenceSession, loads the KV manager.

  • Parameters:

    • session – Inference session to compile and init the KV cache.
    • available_cache_memory – Amount of memory available to the KV cache, in bytes.
  • Returns:

    one per input modality.

  • Return type:

    Either a single KV cache manager or a tuple of KV cache managers

ModelOutputs

class max.pipelines.pipeline.ModelOutputs(next_token_logits: 'Tensor | None' = None, logits: 'Tensor | None' = None)

logits

logits*: Tensor | None* = None

Logits for the entire token sequence.

next_token_logits

next_token_logits*: Tensor | None* = None

Logits for just the next token.

PipelineModel

class max.pipelines.pipeline.PipelineModel(pipeline_config: PipelineConfig, session: InferenceSession)

A pipeline model with setup, input preparation and execution methods.

compute_log_probabilities()

compute_log_probabilities(model_inputs: Iterable[Any], model_outputs: ModelOutputs, next_tokens: Tensor, batch_top_n: list[int], batch_echo: list[bool]) → list[max.pipelines.response.LogProbabilities | None] | None

Optional method that can be overridden to compute log probabilities.

  • Parameters:

    • model_inputs – Inputs to the model returned by prepare_*_token_inputs().
    • model_outputs – Outputs returned by execute().
    • next_tokens – Sampled tokens. Should have shape=[batch size]
    • batch_top_n – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
    • batch_echo – Whether to include input tokens in the returned log probabilities.
  • Returns:

    List of log probabilities.

estimate_memory_footprint()

estimate_memory_footprint() → int

Calculates the estimated memory consumption of our engine and returns the estimated available space to store the KVCache.

execute()

abstract execute(model_inputs: Any) → ModelOutputs

Runs the graph.

prepare_initial_token_inputs()

abstract prepare_initial_token_inputs(context_batch: Sequence[T]) → Iterable[Any]

Prepares the initial inputs to be passed to .execute().

The inputs and functionality of this method can vary per model. For example, the model inputs could include:

  • Encoded tensors
  • A unique IDs for each tensor if this model uses a KV Cache manager.

This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors.

prepare_next_token_inputs()

abstract prepare_next_token_inputs(next_tokens: Tensor, prev_model_inputs: Any) → Any

Prepares the secondary inputs to be passed to .execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

TextGenerationPipeline

class max.pipelines.pipeline.TextGenerationPipeline(pipeline_config: PipelineConfig, pipeline_model: Type[PipelineModel], eos_token_id: int)

Generalized token generator pipeline.

next_token()

next_token(batch: dict[str, T], num_steps: int = 1) → list[dict[str, Any]]

Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.

release()

release(context: T) → None

Mark the context as complete, releasing the cache slot from the KV manager.