Skip to main content

Mojo function

lmcache_offload

lmcache_offload[dtype: DType, page_size: Int, num_kv_heads: Int, head_dim: Int, kv_dim: Int, target: StringSlice[StaticConstantOrigin] = "gpu"](output: LayoutTensor[dtype, Layout.row_major[4](), MutAnyOrigin], paged_cache: LayoutTensor[dtype, Layout.row_major[6](), MutAnyOrigin], slot_mapping: LayoutTensor[DType.int64, Layout.row_major[1](), MutAnyOrigin], start_token: Int, end_token: Int, ctx: DeviceContext)

Offload KV cache data from MAX paged format to external contiguous format.

Parameters:

  • dtype (DType): Data type of the cache.
  • page_size (Int): Number of tokens per page in the paged cache.
  • num_kv_heads (Int): Number of KV attention heads.
  • head_dim (Int): Dimension of each attention head.
  • kv_dim (Int): KV dimension (2 for standard K/V, 1 for MLA).
  • target (StringSlice): Target device ("gpu" or "cpu").

Args:

  • output (LayoutTensor): Destination tensor [kv_dim, num_layers, num_tokens, hidden_dim].
  • paged_cache (LayoutTensor): Source tensor [total_num_blocks, kv_dim, num_layers, page_size, num_heads, head_dim].
  • slot_mapping (LayoutTensor): Token to slot mapping [total_tokens].
  • start_token (Int): Starting token index in slot_mapping.
  • end_token (Int): Ending token index (exclusive) in slot_mapping.
  • ctx (DeviceContext): Device context for kernel launch.

Was this page helpful?