Python module

pipeline

Hugging Face Token Generation Pipeline.

`BatchInfo`

class max.pipelines.lib.pipeline.BatchInfo(past_seq_lens, seq_lens, num_steps)

Information about a batch of requests passed to the pipeline

Parameters:

past_seq_lens (list[int])
seq_lens (list[int])
num_steps (int)

`num_steps`

num_steps: int

Number of steps to do in the pipeline

`past_seq_lens`

past_seq_lens: list[int]

Coordinated list of past sequence lengths (i.e. context lengths)

`seq_lens`

seq_lens: list[int]

Coordinated list of sequence lengths, i.e. prompt_len or 1

`FrequencyData`

class max.pipelines.lib.pipeline.FrequencyData(data, offsets)

Container for token frequency data in CSR format.

Parameters:

data (Tensor)
offsets (Tensor)

`data`

data: Tensor

1D array of the column indices of the: non-zero elements in the matrix.
data[:, 1]: 1D array of the non-zero elements in the: matrix.

Type:: data[
Type:: , 0]

`offsets`

offsets: Tensor

shape [batch_size + 1] indicating start of each sequence’s data.

Type:: Row offsets

`KVCacheMixin`

class max.pipelines.lib.pipeline.KVCacheMixin(*args, **kwargs)

`estimate_kv_cache_size()`

abstract classmethod estimate_kv_cache_size(pipeline_config, available_cache_memory, devices, huggingface_config, kv_cache_config, cache_dtype)

Estimates the size of the kv cache in bytes.

Parameters:

pipeline_config (PipelineConfig)
available_cache_memory (int)
devices (list[Device])
huggingface_config (AutoConfig)
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

int

`get_kv_params()`

abstract classmethod get_kv_params(huggingface_config, n_devices, kv_cache_config, cache_dtype)

Returns the KV cache params for the pipeline model.

Parameters:

huggingface_config (AutoConfig)
n_devices (int)
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

KVCacheParams

`get_num_layers()`

abstract classmethod get_num_layers(huggingface_config)

Returns the number of layers for the pipeline model.

Parameters:: huggingface_config (AutoConfig)
Return type:: int

`load_kv_manager()`

load_kv_manager(session, available_cache_memory)

Provided a PipelineConfig and InferenceSession, loads the KV manager.

Parameters:

session (InferenceSession) – Inference session to compile and init the KV cache.
available_cache_memory (int | None) – Amount of memory available to the KV cache, in bytes.

Returns:

one per input modality.

Return type:

Either a single KV cache manager or a tuple of KV cache managers

`ModelInputs`

class max.pipelines.lib.pipeline.ModelInputs

Base class for model inputs. Use this class to encapsulate inputs for your model. You may store any number of dataclass fields

The following example demonstrates how to create a custom inputs class for a model:

class ReplitInputs(ModelInputs):
    tokens: Tensor
    input_row_offsets: Tensor

    def __init__(self, tokens: Tensor, input_row_offsets: Tensor):
        self.tokens = tokens
        self.input_row_offsets = input_row_offsets

tokens = Tensor.zeros((1, 2, 3), DType.int64)
input_row_offsets = Tensor.zeros((1, 1, 1), DType.int64)

# Initialize inputs
inputs = ReplitInputs(tokens=tokens, input_row_offsets=input_row_offsets)

# Access tensors
list(inputs) == [tokens, input_row_offsets]  # Output: True

`kv_cache_inputs`

kv_cache_inputs: KVCacheInputs | None = None

`lora_ids`

lora_ids: Tensor | None = None

Tensor containing the LoRA ids.

`lora_ranks`

lora_ranks: Tensor | None = None

Tensor containing the LoRA ranks

`update()`

update(**kwargs)

Return type:: None

`ModelOutputs`

class max.pipelines.lib.pipeline.ModelOutputs(logits: 'Tensor', next_token_logits: 'Tensor | None' = None, logit_offsets: 'Tensor | None' = None)

Parameters:

logits (Tensor)
next_token_logits (Tensor | None)
logit_offsets (Tensor | None)

`logit_offsets`

logit_offsets: Tensor | None = None

Offsets to access variable length logits for each sequence.

`logits`

logits: Tensor

Logits for a variable number of tokens per sequence.

`next_token_logits`

next_token_logits: Tensor | None = None

Logits for just the next token.

`PipelineModel`

class max.pipelines.lib.pipeline.PipelineModel(pipeline_config, session, huggingface_config, encoding, devices, kv_cache_config, weights, adapter, return_logits)

A pipeline model with setup, input preparation and execution methods.

Parameters:

pipeline_config (PipelineConfig)
session (InferenceSession)
huggingface_config (AutoConfig)
encoding (SupportedEncoding)
devices (list[Device])
kv_cache_config (KVCacheConfig)
weights (Weights)
adapter (Optional[WeightsAdapter])
return_logits (ReturnLogits)

`calculate_max_seq_len()`

abstract classmethod calculate_max_seq_len(pipeline_config, huggingface_config)

Calculate the optimal max sequence length for the model. Models are expected to implement this method.

The following example shows how to implement this method for a Mistral model:

class MistralModel(PipelineModel):
    @classmethod
    def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int:
        try:
            return upper_bounded_default(
                upper_bound=huggingface_config.max_seq_len,
                default=pipeline_config.max_length,
            )
        except ValueError as e:
            msg = (
                "Unable to infer max_length for Mistral, the provided "
                f"max_length ({pipeline_config.max_length}) exceeds the "
                f"model's max_seq_len ({huggingface_config.max_seq_len})."
            )
            raise ValueError(msg) from e

Parameters:

pipeline_config (PipelineConfig) – Configuration for the pipeline.
huggingface_config (AutoConfig) – Hugging Face model configuration.

Returns:

The maximum sequence length to use.

Return type:

int

`compute_log_probabilities()`

compute_log_probabilities(session, model_inputs, model_outputs, next_tokens, batch_top_n, batch_echo)

Optional method that can be overridden to compute log probabilities.

Parameters:

session (InferenceSession) – Inference session to compute log probabilities within.
model_inputs (ModelInputs) – Inputs to the model returned by prepare_*_token_inputs().
model_outputs (ModelOutputs) – Outputs returned by execute().
next_tokens (Tensor) – Sampled tokens. Should have shape=[batch size]
batch_top_n (list[int]) – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
batch_echo (list[bool]) – Whether to include input tokens in the returned log probabilities.

Returns:

List of log probabilities.

Return type:

list[LogProbabilities | None] | None

`dtype`

property dtype: DType

`estimate_activation_memory()`

classmethod estimate_activation_memory(pipeline_config, huggingface_config)

Estimates the activation memory required for model execution.

This accounts for temporary memory buffers used during model execution, such as intermediate activations and working buffers.

The default implementation returns 0 for backward compatibility. Models with significant activation memory requirements should override this method to provide accurate estimates.

Parameters:

pipeline_config (PipelineConfig) – Pipeline configuration
huggingface_config (AutoConfig) – HuggingFace model configuration

Returns:

Estimated activation memory in bytes

Return type:

int

`estimate_weights_size()`

classmethod estimate_weights_size(pipeline_config)

Calculates the estimated memory consumption of our model.

Parameters:: pipeline_config (PipelineConfig)
Return type:: int

`execute()`

abstract execute(model_inputs)

Executes the graph with the given inputs.

Parameters:: model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.
Returns:: ModelOutputs containing the pipeline’s output tensors.
Return type:: ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

`infer_optimal_batch_size()`

classmethod infer_optimal_batch_size(pipeline_config, available_cache_memory, huggingface_config, devices, kv_cache_config, cache_dtype)

Returns the estimated optimal batch size to run the model given current memory constraints.

Parameters:

pipeline_config (PipelineConfig)
available_cache_memory (int)
huggingface_config (AutoConfig)
devices (list[Device])
kv_cache_config (KVCacheConfig)
cache_dtype (DType)

Return type:

int

`prepare_initial_token_inputs()`

abstract prepare_initial_token_inputs(context_batch, kv_cache_inputs=None, return_n_logits=1)

Prepares the initial inputs to be passed to .execute().

The inputs and functionality of this method can vary per model. For example, the model inputs could include:

Encoded tensors
A unique IDs for each tensor if this model uses a KV Cache manager.
kv_cache_inputs: The kv cache inputs required for the model. This should be None if the model does not use KV Cache. This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors.

Parameters:

context_batch (Sequence[T])
kv_cache_inputs (KVCacheInputs | None)
return_n_logits (int)

Return type:

ModelInputs

`prepare_next_token_inputs()`

abstract prepare_next_token_inputs(next_tokens, prev_model_inputs)

Prepares the secondary inputs to be passed to .execute().

While prepare_initial_token_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

Parameters:

next_tokens (Tensor)
prev_model_inputs (ModelInputs)

Return type:

ModelInputs

`TextGenerationPipeline`

class max.pipelines.lib.pipeline.TextGenerationPipeline(pipeline_config, pipeline_model, eos_token_id, weight_adapters)

Generalized token generator pipeline.

Parameters:

pipeline_config (PipelineConfig)
pipeline_model (type[PipelineModel])
eos_token_id (int)
weight_adapters (dict[WeightsFormat, WeightsAdapter])

`calculate_num_steps()`

calculate_num_steps(num_steps, context)

Parameters:

num_steps (int)
context (T)

Return type:

int

`execute()`

execute(inputs)

Provided a batch, process batch inputs, execute the graph for num_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens.

Parameters:: inputs (TextGenerationInputs[T])
Return type:: dict[str, TextGenerationOutput]

`prepare_batch()`

prepare_batch(batch, num_steps)

Parameters:

batch (list[T])
num_steps (int)

Return type:

tuple[ModelInputs, int, ndarray[Any, dtype[int32]] | None]

`release()`

release(request_id)

Mark the context as complete, releasing the cache slot from the KV manager.

Parameters:: request_id (str)
Return type:: None

`sample_logits()`

sample_logits(logits, prev_tokens, top_k, max_k, temperature, top_p, seed, *, logit_offsets=None, bitmask=None, frequency_data=None, min_tokens_mask=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None)

Parameters:

logits (Tensor)
prev_tokens (Tensor)
top_k (Tensor)
max_k (Tensor)
temperature (Tensor)
top_p (Tensor)
seed (Tensor)
logit_offsets (Tensor | None)
bitmask (Tensor | None)
frequency_data (Sequence[FrequencyData] | None)
min_tokens_mask (Tensor | None)
frequency_penalty (Tensor | None)
presence_penalty (Tensor | None)
repetition_penalty (Tensor | None)

Return type:

tuple[Tensor, Tensor, Tensor]

`get_paged_manager()`

max.pipelines.lib.pipeline.get_paged_manager(pipeline)

Parameters:: pipeline (Pipeline)
Return type:: PagedKVCacheManager | None

`upper_bounded_default()`

max.pipelines.lib.pipeline.upper_bounded_default(upper_bound, default)

Given an upper bound and an optional default value, returns a final value that cannot exceed the upper bound.

Parameters:

default (int | None) – The default value to use, or None to use the upper bound.
upper_bound (int) – The upper bound to use.

Raises:

ValueError – If the provided default value exceeds the upper bound.

Returns:

The final value.

Return type:

int

View source

Was this page helpful?

Thank you! We'll create more content like this.

Thank you for helping us improve!

BatchInfo​

num_steps​

past_seq_lens​

seq_lens​

FrequencyData​

data​

offsets​

KVCacheMixin​

estimate_kv_cache_size()​

get_kv_params()​

get_num_layers()​

load_kv_manager()​

ModelInputs​

kv_cache_inputs​

lora_ids​

lora_ranks​

update()​

ModelOutputs​

logit_offsets​

logits​

next_token_logits​

PipelineModel​

calculate_max_seq_len()​

compute_log_probabilities()​

dtype​

estimate_activation_memory()​

estimate_weights_size()​

execute()​

infer_optimal_batch_size()​

prepare_initial_token_inputs()​

prepare_next_token_inputs()​

TextGenerationPipeline​

calculate_num_steps()​

execute()​

prepare_batch()​

release()​

sample_logits()​

get_paged_manager()​

upper_bounded_default()​

`BatchInfo`

`num_steps`

`past_seq_lens`

`seq_lens`

`FrequencyData`

`data`

`offsets`

`KVCacheMixin`

`estimate_kv_cache_size()`

`get_kv_params()`

`get_num_layers()`

`load_kv_manager()`

`ModelInputs`

`kv_cache_inputs`

`lora_ids`

`lora_ranks`

`update()`

`ModelOutputs`

`logit_offsets`

`logits`

`next_token_logits`

`PipelineModel`

`calculate_max_seq_len()`

`compute_log_probabilities()`

`dtype`

`estimate_activation_memory()`

`estimate_weights_size()`

`execute()`

`infer_optimal_batch_size()`

`prepare_initial_token_inputs()`

`prepare_next_token_inputs()`

`TextGenerationPipeline`

`calculate_num_steps()`

`execute()`

`prepare_batch()`

`release()`

`sample_logits()`

`get_paged_manager()`

`upper_bounded_default()`