Python module
pipeline
HF Token Generation Pipeline
KVCacheMixin
class max.pipelines.pipeline.KVCacheMixin(*args, **kwargs)
estimate_kv_cache_size()
Estimates the size of the kv cache in bytes.
load_kv_manager()
load_kv_manager(session: InferenceSession, available_cache_memory: int) → KVCacheManager
Provided a PipelineConfig and InferenceSession, loads the KV manager.
-
Parameters:
- session – Inference session to compile and init the KV cache.
- available_cache_memory – Amount of memory available to the KV cache, in bytes.
-
Returns:
one per input modality.
-
Return type:
Either a single KV cache manager or a tuple of KV cache managers
ModelOutputs
class max.pipelines.pipeline.ModelOutputs(next_token_logits: 'Tensor | None' = None, logits: 'Tensor | None' = None)
logits
Logits for the entire token sequence.
next_token_logits
Logits for just the next token.
PipelineModel
class max.pipelines.pipeline.PipelineModel(pipeline_config: PipelineConfig, session: InferenceSession)
A pipeline model with setup, input preparation and execution methods.
compute_log_probabilities()
compute_log_probabilities(model_inputs: Iterable[Any], model_outputs: ModelOutputs, next_tokens: Tensor, batch_top_n: list[int], batch_echo: list[bool]) → list[max.pipelines.response.LogProbabilities | None] | None
Optional method that can be overridden to compute log probabilities.
-
Parameters:
- model_inputs – Inputs to the model returned by prepare_*_token_inputs().
- model_outputs – Outputs returned by execute().
- next_tokens – Sampled tokens. Should have shape=[batch size]
- batch_top_n – Number of top log probabilities to return per input in the batch. For any element where top_n == 0, the LogProbabilities is skipped.
- batch_echo – Whether to include input tokens in the returned log probabilities.
-
Returns:
List of log probabilities.
estimate_memory_footprint()
estimate_memory_footprint() → int
Calculates the estimated memory consumption of our engine and returns the estimated available space to store the KVCache.