For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Read a Hugging Face model config
When importing a Hugging Face model into MAX as a custom architecture, you're
responsible for building the graph that the model's config.json describes.
This makes reading and interpreting the config your job. It's the entry point
of your model bring-up workflow and
the file you consult most frequently.
What is a model config?β
A Hugging Face config.json is the serialized record of how a model was built.
It represents the
PretrainedConfig
class generated during training or weight serialization, carrying only the
model's architectural hyperparameters: the dimensions and structural flags that
define the graph's shape (rather than training settings like learning rate). It
contains neither the execution graph nor the tensor weights.
When you run inference with Hugging Face's
transformers library, you
rarely inspect this file directly: the library reads the configuration
automatically and instantiates the architecture for you. When you port a model
to MAX, that instantiation step is yours to write.
By reviewing the configuration fields, you can identify which values map directly to the MAX configuration schema, which flags require custom graph layers, and which entries might contain incorrect defaults that you need to verify against the weight dimensions. Use the sections below to interpret each field and translate it into a concrete architecture plan.
Read the configβ
Before you write any architecture code, load and inspect the complete configuration to trace the model's architectural characteristics. The dimensions map directly to your MAX configuration, while non-default flags locate where the model's execution flow diverges from a standard transformer: a decoder-only stack of uniform attention and MLP blocks, like Llama.
You can read a model's configuration on Hugging Face by navigating to the
Files and versions tab of its model repository and viewing the config.json
file, or you can load the config locally and print each field. Start by looking
at the complete picture rather than the handful of fields a model card mentions.
To view the configuration without downloading the model's tensor weights,
create a script named load_config.py with the following code:
from transformers import AutoConfig
# Replace with your model ID.
config = AutoConfig.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", trust_remote_code=True)
print(config)Three fields in the printed configuration frame how you read the rest. The
following snippet is from TinyLlama 1.1B, with architectures, model_type,
and torch_dtype highlighted. The remaining properties are the dimensions and
flags you'll map in the sections below:
LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"torch_dtype": "bfloat16",
"eos_token_id": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5632,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 22,
"num_key_value_heads": 4,
"pad_token_id": null,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_parameters": {
"rope_theta": 10000.0,
"rope_type": "default"
},
"tie_word_embeddings": false,
"transformers_version": "5.9.0",
"use_cache": true,
"vocab_size": 32000
}architectures: Lists the architecture classes. The first entry (architectures[0]) is the exact string identifier you must register in your architecture'sarch.pyfile.torch_dtype(ordtype): Defines the precision dtype in which the weights are serialized, establishing the default encoding declared in the pipeline. (torch_dtypeis deprecated and replaced bydtype.)model_type: Identifies the model family (such asllamaorqwen3), pointing to the closest baseline MAX architecture and indicating the expected weight serialization layout.
Beyond these core identifiers, the remaining parameters are highly model-specific. They reflect the exact architectural dimensions, flags, and design choices established by the model authors during the training phase.
Because instruct, base, and chat variants often differ in parameters like
max_position_embeddings, rope_scaling, or tie_word_embeddings, always load
the config for the exact checkpoint you're porting. Copying values from a
sibling variant can result in a model that loads without errors but fails at
long sequence lengths or on the final logits projection.
Explore other model configsβ
TinyLlama is a model checkpoint using the standard Llama architecture
(LlamaForCausalLM), where almost every field maps directly, and the only
special consideration is rope_scaling if porting a larger context sibling.
Examining the configuration files for other architectures supported by MAX
illustrates the spectrum of divergence you encounter:
- Qwen3
- DeepSeek-V3
Qwen3's configuration file looks nearly identical to Llama's, specifying grouped-query attention,
RoPE, and an explicit head_dim. However, its primary structural difference is completely
absent from the configuration file. Qwen3 applies an RMSNorm to the query and key projections
prior to attention. This operation is defined entirely in the model's execution code, rather
than its configuration JSON.
{
"architectures": ["Qwen3ForCausalLM"],
"model_type": "qwen3",
"hidden_size": 4096,
"num_hidden_layers": 36,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"head_dim": 128,
"intermediate_size": 12288,
"max_position_embeddings": 40960,
"rope_theta": 1000000,
"torch_dtype": "bfloat16"
}Relying solely on this config leads you to reuse the Llama 3 architecture. The custom
qwen3 architecture exists to accommodate this unlisted difference, which is why you must
cross-reference config files with the source modeling scripts.
DeepSeek-V3 represents the opposite extreme, where the configuration file flags complex graph implementations: it combines multi-head latent attention (MLA), a mixture-of-experts (MoE) MLP block, YaRN context scaling, and FP8-quantized weights.
{
"architectures": ["DeepseekV3ForCausalLM"],
"model_type": "deepseek_v3",
"hidden_size": 7168,
"num_hidden_layers": 61,
"num_attention_heads": 128,
"q_lora_rank": 1536,
"kv_lora_rank": 512,
"n_routed_experts": 256,
"num_experts_per_tok": 8,
"n_shared_experts": 1,
"rope_scaling": {"type": "yarn", "factor": 40},
"quantization_config": {"quant_method": "fp8", "weight_block_size": [128, 128]},
"torch_dtype": "bfloat16"
}q_lora_rank and kv_lora_rank configure latent attention; n_routed_experts and
num_experts_per_tok dictate MoE routing; and quantization_config indicates that the released
weights are FP8. Each of these parameters signals custom graph logic, requiring a distinct
deepseek_v3 pipeline architecture rather than a dense decoder variant.
These configuration fields generally fall into three categories based on their architectural scope. This split establishes your development roadmap.
Add dimensions directly to model_config.pyβ
When mapping fields from config.json, start by identifying static dimensions
and hyperparameters. These fields parameterize existing math (such as sizing
matrices or defining scalar constants) but don't alter the sequence of
operations. You'll define these fields in your architecture's
model_config.py file, translating them into a typed configuration dataclass,
in which each field is a class attribute. For example:
@dataclass(kw_only=True)
class MyModelConfig(ArchConfigWithKVCache):
hidden_size: int
num_hidden_layers: int
num_attention_heads: int
# ... one attribute per config field you map ...The following list shows common dimensional parameters that mean the same thing across nearly all transformer architectures. While this list isn't exhaustive, these are the most common fields that pass directly into standard layers without requiring graph changes:
hidden_sizenum_hidden_layersnum_attention_headsnum_key_value_headsintermediate_sizevocab_sizemax_position_embeddingsrms_norm_epshidden_acttie_word_embeddingsrope_theta- Special token IDs (such as
bos_token_idandeos_token_id)
TinyLlama represents this simpler class of configuration. You declare these
fields in model_config.py (see
Map config fields).
Four of these parameters (num_key_value_heads, head_dim,
num_hidden_layers, and max_seq_len) also size the KV cache that MAX
allocates at startup (see
the cache parameters).
The model_config.py file defines a typed configuration dataclass and
implements an initialize_from_config() classmethod to parse the Hugging Face
JSON. While most fields map directly, others require preprocessing:
max_position_embeddings is clamped against the pipeline's runtime context
limit to produce max_seq_len, and the dtype is selected based on the target
quantization encoding rather than the config's torch_dtype.
Even simple dimensions require validation: hidden_act maps to a single
parameter, but gelu, gelu_new, and gelu_tanh represent distinct activation
functions. Binding the wrong activation function introduces subtle numerical
drift in the MLP outputs.
Depending on the similarity of your model's architecture to existing MAX architectures, you implement these mappings as a standalone base configuration or a concise subclass:
- Llama 3.1
- Qwen3
- DeepSeek-V3
Llama3Config is the baseline architecture configuration class that dense decoders extend. Open the configuration schema
in llama3/model_config.py
to inspect how each direct-map field is defined as a typed attribute:
@dataclass(kw_only=True)
class Llama3Config(ArchConfigWithStoredKVParams, ArchConfigWithKVCache):
hidden_size: int
num_attention_heads: int
num_key_value_heads: int
num_hidden_layers: int
intermediate_size: int
vocab_size: int
max_seq_len: int
rope_theta: float
rms_norm_eps: float | None = None
tie_word_embeddings: bool = False
dtype: DType
kv_params: KVCacheParams
# ... plus rope scaling params, multipliers, devices, quantization ...
@classmethod
def initialize_from_config(cls, pipeline_config, huggingface_config, model_config=None):
return cls(
hidden_size=huggingface_config.hidden_size,
num_attention_heads=huggingface_config.num_attention_heads,
num_key_value_heads=huggingface_config.num_key_value_heads,
num_hidden_layers=huggingface_config.num_hidden_layers,
intermediate_size=huggingface_config.intermediate_size,
vocab_size=huggingface_config.vocab_size,
rope_theta=get_rope_theta(huggingface_config),
# max_seq_len is clamped, not copied; dtype comes from the encoding.
max_seq_len=Llama3Config.calculate_max_seq_len(pipeline_config, huggingface_config),
dtype=supported_encoding_dtype(pipeline_config.model.quantization_encoding),
# ... remaining fields ...
)Because Qwen3's direct-map fields match Llama's, the configuration class in
qwen3/model_config.py
subclasses Llama3Config to inherit its base fields and appends only the custom MoE properties:
@dataclass(kw_only=True)
class Qwen3Config(Llama3Config):
# Inherits hidden_size, num_hidden_layers, vocab_size, rope_theta, and the
# rest of the direct-map fields from Llama3Config. Adds the MoE fields.
num_experts: int = 0
num_experts_per_tok: int = 1
moe_intermediate_size: int = 0
@classmethod
def initialize_from_config(cls, pipeline_config, huggingface_config, model_config=None):
base = Llama3Config.initialize_from_config(
pipeline_config, huggingface_config, model_config
)
return cls(
hidden_size=base.hidden_size,
num_attention_heads=base.num_attention_heads,
# ... carry the rest of the direct-map fields from base ...
num_experts=getattr(huggingface_config, "num_experts", 0),
num_experts_per_tok=getattr(huggingface_config, "num_experts_per_tok", 1),
)DeepSeek-V3 diverges significantly from dense decoders. Open the standalone configuration class in
deepseekV3/model_config.py
to review the custom schema mapping general parameters alongside structural fields:
@dataclass(kw_only=True)
class DeepseekV3Config(ArchConfigWithKVCache):
# The same universal fields, with DeepSeek's defaults.
hidden_size: int = 7168
num_hidden_layers: int = 61
num_attention_heads: int = 128
intermediate_size: int = 18432
vocab_size: int = 129280
rms_norm_eps: float = 1e-6
rope_theta: float = 10000.0
# Structural fields, carried as config values for the layer code to read.
q_lora_rank: int = 1536 # multi-head latent attention
kv_lora_rank: int = 512
qk_nope_head_dim: int = 128
qk_rope_head_dim: int = 64
v_head_dim: int = 128
n_routed_experts: int = 256 # mixture of experts
num_experts_per_tok: int = 8
n_shared_experts: int = 1Build graph layers for structural fieldsβ
Structural fields signal that the model departs from a standard transformer
architecture and requires custom layer logic. When you identify these fields,
prepare to declare or extend custom layers in the model's graph implementation
file (such as llama3.py or deepseekV3.py) using
MAX modules:
- Grouped-query or multi-query attention (
num_key_value_headsbelownum_attention_heads, ormulti_query) shares keys and values across query heads. - Sliding-window attention (
sliding_window) or softcapping (attn_logit_softcapping,final_logit_softcapping) changes the attention mask or score path. - Mixture of experts (
num_experts,num_experts_per_tok, router flags) turns the MLP into a routed set of experts. - Multi-head latent attention (
q_lora_rank,kv_lora_rank,qk_nope_head_dim,qk_rope_head_dim) projects attention through a low-rank latent space. - QK-norm, post-norm, or MuP scalars (
use_qk_norm,use_post_norm,embedding_multiplier,logits_scaling) add norms or scale factors inside the block. - Quantized weights (
quantization_config) need a dequantization path for the released checkpoint (see Extend the compute graph).
Often, the closest baseline MAX architecture already implements some of these mechanisms. The following three examples show what it looks like to implement custom layer logic based on these structural fields:
- Llama 3.1
- Qwen3
- DeepSeek-V3
Llama 3.1's only structural variation is rope_scaling. This requires no custom layers; instead,
read the scaling parameters from the configuration and pass them to the rotary embedding constructor in
llama3/model_config.py:
# Read rope_scaling into typed params if the config uses the llama3 scheme.
rope_scaling = getattr(huggingface_config, "rope_scaling", None)
if rope_scaling and rope_scaling.get("rope_type") == "llama3":
rope_scaling_params = Llama3RopeScalingParams(
factor=rope_scaling["factor"],
low_freq_factor=rope_scaling["low_freq_factor"],
high_freq_factor=rope_scaling["high_freq_factor"],
orig_max_position=rope_scaling["original_max_position_embeddings"],
)
# Build the RoPE embedding (scaling_params=None yields standard RoPE).
rope = Llama3RotaryEmbedding(
dim=hidden_size,
n_heads=num_attention_heads,
theta=rope_theta,
max_seq_len=max_seq_len,
scaling_params=rope_scaling_params,
)Qwen3 applies a RMSNorm to the query and key projections before the RoPE calculation. To implement this
in the compute graph, define two additional normalization layers in
qwen3/layers/attention.py
and apply them to query and key projections in the block forward pass:
class Qwen3Attention(Module):
def __init__(self, *, hidden_size, num_attention_heads, kv_params, ...):
super().__init__()
# ... standard QKV and output projections ...
# Per-head RMSNorm for Q and K, the Qwen3-specific layers.
self.q_norm = RMSNorm(kv_params.head_dim, dtype=norm_dtype, eps=qk_norm_eps)
self.k_norm = RMSNorm(kv_params.head_dim, dtype=norm_dtype, eps=qk_norm_eps)
def __call__(self, layer_idx, x, kv_collection, freqs_cis, input_row_offsets):
head_dim = self.kv_params.head_dim
qkv = self.qkv_proj(x)
x_q, x_k, x_v = ops.split(qkv, [q_dim, kv_dim, kv_dim], axis=-1)
# Apply per-head QK norm before RoPE. This is the structural change.
x_q = self.q_norm(x_q.reshape((-1, self.n_heads, head_dim))).reshape((-1, q_dim))
x_k = self.k_norm(x_k.reshape((-1, self.num_key_value_heads, head_dim))).reshape((-1, kv_dim))
# ... Re-concat, apply RoPE, store to KV cache, flash attention, output projection ...
qkv = ops.concat((x_q, x_k, x_v), axis=-1)
...DeepSeek-V3 replaces the standard attention mechanism with Multi-head Latent Attention (MLA). The block
reads low-rank projection parameters directly from the configuration, instantiating a custom attention
module in deepseekV3/deepseekV3.py:
# Multi-head latent attention: q_lora_rank and kv_lora_rank project Q and KV
# through low-rank latent spaces. The head dimension splits into no-rope and
# rope parts, and the parallelism mode determines the concrete class.
self.self_attn = TensorParallelLatentAttentionWithRope(
rope=rope,
num_attention_heads=config.num_attention_heads,
hidden_size=config.hidden_size,
kv_params=config.kv_params,
q_lora_rank=config.q_lora_rank,
kv_lora_rank=config.kv_lora_rank,
qk_nope_head_dim=config.qk_nope_head_dim,
qk_rope_head_dim=config.qk_rope_head_dim,
v_head_dim=config.v_head_dim,
devices=config.devices,
)Recognize macro-architecture changesβ
While most parameters dictate sizes or internal layer logic, some fields signal that the overall sequence of operations must change, requiring a completely new graph.
For example, a vision_config field implies the model is multimodal and needs
a separate vision tower (like
Pixtral) to embed images
before passing them to the transformer. An encoder-decoder structure (like T5)
breaks the standard decoder-only block sequence entirely. If a parameter alters
how blocks are wired together or what inputs they accept, you must write a new
macro-graph implementation rather than just swapping a custom layer into a
standard pipeline.
Next stepsβ
Once you map the configuration fields to config properties and compute graph requirements, you can proceed with the remaining integration steps. The remainder of the model port involves registering your new architecture, mapping checkpoint weights, and validating output logits against the source framework. The model bring-up workflow provides an end-to-end view of this sequence.
These pages go deeper on the steps that follow:
- Serve custom model architectures:
Register your architecture package and serve it with
max serve. - Quantization: Work with quantized weight encodings
when a checkpoint's
quantization_configcalls for them.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!