For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
KVTransferEngine
KVTransferEngineβ
class max.pipelines.kv_cache.KVTransferEngine(name, tensors, *, total_num_pages, replicate_kv_across_tp=False, extra_tensor_groups=None)
Bases: object
KVCache Transfer Engine with support for Data Parallelism (DP) and Tensor Parallelism (TP).
The engine accepts a 2D list of tensors: list[list[Buffer]] where the outer list represents DP replicas and the inner list represents TP shards within each replica.
The TransferEngine communicates with other TransferEngines in other threads or processes. However, individual TransferEngines themselves are not thread-safe. It is intended to be used by MAXβs single-threaded scheduler.
Initialize the transfer engine.
-
Parameters:
-
- name (str) β Unique name for this engine.
- tensors (Sequence[Sequence[Buffer]]) β Main group tensors as
[replica][tp_shard]. - total_num_pages (int) β Total KV cache pages per tensor.
- replicate_kv_across_tp (bool) β Whether KV is replicated across TP ranks.
- extra_tensor_groups (Sequence[Sequence[Sequence[Buffer]]] | None) β Additional tensor groups (e.g., draft KV for
speculative decoding). Each entry has the same
[replica][tp_shard]structure astensors. All tensors in each group must have the same shape within that group, but groups may differ in shape.
bytes_per_groupβ
Bytes per page for each group. bytes_per_group[0] is the main
group; subsequent entries are extra groups (e.g., draft KV in
speculative decoding).
bytes_per_pageβ
bytes_per_page: int
Total bytes per page across all groups. For single-group engines this
equals the main groupβs bytes per page; for multi-group engines it is
sum(bytes_per_group).
cleanup()β
cleanup()
Release all resources associated with the transfer engine.
Should be called before the transfer engine is garbage collected. Moving this logic into the __del__ destructor does causes a UCX error for unknown reasons.
-
Return type:
-
None
cleanup_transfer()β
cleanup_transfer(transfer_req)
Cleanup a transfer. This should be called after a transfer is complete.
-
Parameters:
-
transfer_req (TransferReqData) β The transfer request to cleanup.
-
Return type:
-
None
completed_recv_transfersβ
Map of agent names to completed recv transfers.
connect()β
connect(remote)
Connect to a remote engine (all replicas).
-
Parameters:
-
remote (KVTransferEngineMetadata) β Metadata for the remote engine (all replicas).
-
Return type:
-
None
disconnect()β
disconnect(name)
Tear down a single remote connection.
Releases inflight transfer handles referencing this remote,
invalidates NIXL metadata, and removes bookkeeping entries.
After disconnect, connect() will accept the same name again.
-
Parameters:
-
name (str) β The name of the remote engine to disconnect.
-
Raises:
-
ValueError β If the named remote is not currently connected.
-
Return type:
-
None
dpβ
dp: int
Number of DP replicas.
from_paged_kv_cache()β
classmethod from_paged_kv_cache(name, kv_cache)
Construct an engine wired to a PagedKVCacheManager.
Pulls the per-replica device buffers, sets total_num_pages, and
derives replicate_kv_across_tp from the cache params. Equivalent to
constructing the engine manually but consolidates the boilerplate that
prefill/decode schedulers share.
For models with multiple KV caches (e.g., speculative decoding with a separate target and draft KV), each child cache is registered as its own NIXL group so that heterogeneous buffer shapes (e.g., 61-layer MLA target vs. 1-layer Eagle draft) are validated and transferred independently.
-
Parameters:
-
- name (str)
- kv_cache (PagedKVCacheManager)
-
Return type:
inflight_send_transfersβ
inflight_send_transfers: dict[str, TransferReqData]
Map of transfer names to send transfer request data.
initiate_read_transfer()β
initiate_read_transfer(remote_metadata, src_idxs, dst_idxs, src_replica_idx, dst_replica_idx, tp_shard_limit=None)
Initiate a READ transfer from remote engine to current engine.
The current engine pulls data from the remote. Used by DKVConnector to read KV blocks from BlockStore DRAM into GPU VRAM.
-
Parameters:
-
- remote_metadata (KVTransferEngineMetadata) β Metadata for the remote engine (source).
- src_idxs (list[int]) β Page indices in the remote engine (source).
- dst_idxs (list[int]) β Page indices in the current engine (destination).
- src_replica_idx (int) β Replica index in the remote engine.
- dst_replica_idx (int) β Replica index in the current engine.
- tp_shard_limit (int | None) β If set, only the first N TP shards transfer.
-
Return type:
initiate_send_transfer()β
initiate_send_transfer(remote_metadata, src_idxs, dst_idxs, src_replica_idx, dst_replica_idx, tp_shard_limit=None)
Initiate a transfer from current engine to remote engine.
The same page indices are broadcast to all TP shards within the source and destination replicas.
-
Parameters:
-
- remote_metadata (KVTransferEngineMetadata) β Metadata for the remote engine.
- src_idxs (list[int]) β List of indices of the source pages in the current engine.
- dst_idxs (list[int]) β List of indices of the destination pages in the remote engine.
- src_replica_idx (int) β Index of the source replica to transfer from.
- dst_replica_idx (int) β Index of the destination replica to transfer to.
- tp_shard_limit (int | None) β Maximum number of TP shards to transfer. When set,
only the first
tp_shard_limitshards participate in the transfer. Useful for MLA models where KV data is identical across shards.
-
Return type:
is_complete()β
is_complete(transfer_req)
Checks if a given send, recv, or read transfer is completed.
-
Parameters:
-
transfer_req (TransferReqData) β The transfer request.
-
Returns:
-
True if all transfers have completed; false otherwise.
-
Return type:
memory_typeβ
memory_type: MemoryType
Type of memory being managed (e.g. DRAM).
metadataβ
property metadata: KVTransferEngineMetadata
Get metadata for all replicas.
-
Returns:
-
Metadata for the entire engine (all replicas).
nameβ
name: str
Name of transfer engine / nixl agent.
remote_agent_to_engineβ
Map of remote agent names to their engine names.
remote_connectionsβ
remote_connections: dict[str, KVTransferEngineMetadata]
Map of remote engine names to their metadata.
replicate_kv_across_tpβ
replicate_kv_across_tp: bool
Whether KV is replicated across TP ranks (MLA).
sync_and_release()β
sync_and_release(transfer_req, timeout_s=30.0)
Waits for a transfer to complete and releases it.
-
Parameters:
-
- transfer_req (TransferReqData) β The transfer request to wait on.
- timeout_s (float) β Maximum seconds to wait before raising TimeoutError.
-
Raises:
-
TimeoutError β If the transfer does not complete within timeout_s.
-
Return type:
-
None
tensor_agentsβ
[replica][tp_shard].
-
Type:
-
2D list of TensorAgent objects
total_num_pagesβ
total_num_pages: int
Total number of pages in each tensor (same across all replicas).
tpβ
tp: int
Number of TP shards per replica.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!