IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.pipelines.architectures.lfm2

ConvStateCache​

class max.pipelines.architectures.lfm2.ConvStateCache(num_conv_layers, hidden_size, conv_kernel, dtype, max_slots, device)

source

Bases: object

Parameters:

claim()​

claim(request_id)

source

Parameters:

request_id (RequestID)

Return type:

None

get_states()​

get_states(request_ids)

source

Return one [N, hidden, kernel] buffer per conv layer.

For N == 1 this is zero-copy (returns the slot’s buffer directly). For N > 1 per-slot buffers are concatenated along the leading batch dim via numpy round-trip β€” the conv state is small (hidden * kernel per slot), so this is acceptable.

Parameters:

request_ids (list[RequestID])

Return type:

list[Buffer]

release()​

release(request_id)

source

Parameters:

request_id (RequestID)

Return type:

None

update_states()​

update_states(request_ids, new_states)

source

Store updated per-layer states back into their request slots.

For N == 1 the buffer reference is stored directly. For N > 1 the leading batch dim is split and each slice is copied into the matching slot.

Parameters:

Return type:

None

LFM2Config​

class max.pipelines.architectures.lfm2.LFM2Config(*, hidden_size, num_attention_heads, num_key_value_heads, num_hidden_layers, rope_theta, rope_scaling_params, max_seq_len, intermediate_size, interleaved_rope_weights, vocab_size, dtype, model_quantization_encoding, quantization_config, kv_params, return_logits=ReturnLogits.LAST_TOKEN, norm_method='rms_norm', norm_dtype=None, attention_bias=False, rms_norm_eps=None, tie_word_embeddings=False, stacked_mlp=False, stacked_qkv=False, attention_multiplier, embedding_multiplier, residual_multiplier, devices, clip_qkv, quant_config=None, lora_config=None, longrope_scaling_params=None, logits_scaling=1.0, return_hidden_states=ReturnHiddenStates.NONE, target_layer_ids=None, use_subgraphs=True, data_parallel_degree=1, sliding_window=None, layer_types=<factory>, conv_L_cache=3, conv_bias=False, norm_eps=1e-05)

source

Bases: Llama3Config

Model configuration for LFM2 graph construction/execution.

Parameters:

conv_L_cache​

conv_L_cache: int = 3

source

conv_bias​

conv_bias: bool = False

source

finalize()​

finalize(huggingface_config, state_dict, return_logits, return_hidden_states=ReturnHiddenStates.NONE, norm_method='rms_norm', attention_bias=False)

source

Define parameters that can’t be determined just from the pipeline config.

Parameters:

Return type:

None

initialize_from_config()​

classmethod initialize_from_config(pipeline_config, huggingface_config, model_config=None)

source

Parameters:

  • pipeline_config (Any)
  • huggingface_config (AutoConfig)
  • model_config (Any)

Return type:

LFM2Config

layer_types​

layer_types: list[str]

source

norm_eps​

norm_eps: float = 1e-05

source

LFM2Inputs​

class max.pipelines.architectures.lfm2.LFM2Inputs(tokens: 'Buffer', input_row_offsets: 'Buffer', signal_buffers: 'list[Buffer]', return_n_logits: 'Buffer', data_parallel_splits: 'Buffer | Sequence[Sequence[int]] | None' = None, conv_states: 'list[Buffer]' = <factory>, request_ids: 'list[RequestID]' = <factory>, *, kv_cache_inputs: 'KVCacheInputs[Buffer, Buffer] | None' = None, lora: 'LoRAInputs | None' = None, hidden_states: 'Buffer | list[Buffer] | None' = None)

source

Bases: Llama3Inputs

Parameters:

buffers​

property buffers: tuple[Buffer, ...]

source

Returns positional Buffer inputs for model ABI calls.

conv_states​

conv_states: list[Buffer]

source

request_ids​

request_ids: list[RequestID]

source

LFM2Model​

class max.pipelines.architectures.lfm2.LFM2Model(pipeline_config, session, devices, kv_cache_config, weights, adapter=None, return_logits=ReturnLogits.LAST_TOKEN, return_hidden_states=ReturnHiddenStates.NONE)

source

Bases: LlamaModelBase

LFM2 hybrid (full-attention + conv) pipeline model.

Parameters:

attention_bias​

attention_bias: bool = False

source

Whether to use attention bias.

execute()​

execute(model_inputs)

source

Executes the graph with the given inputs.

Parameters:

model_inputs (ModelInputs) – The model inputs to execute, containing tensors and any other required data for model execution.

Returns:

ModelOutputs containing the pipeline’s output tensors.

Return type:

ModelOutputs

This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic.

model_config_cls​

model_config_cls

source

alias of LFM2Config

norm_method​

norm_method: Literal['rms_norm'] | Literal['layer_norm'] = 'rms_norm'

source

Normalization layer.

prepare_initial_token_inputs()​

prepare_initial_token_inputs(replica_batches, kv_cache_inputs=None, return_n_logits=1)

source

Prepare the inputs for the first pass in multistep execution.

Parameters:

Return type:

LFM2Inputs

release()​

release(request_id)

source

Parameters:

request_id (RequestID)

Return type:

None