IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Read a Hugging Face model config

When importing a Hugging Face model into MAX as a custom architecture, you're responsible for building the graph that the model's config.json describes. This makes reading and interpreting the config your job. It's the entry point of your model bring-up workflow and the file you consult most frequently.

What is a model config?​

A Hugging Face config.json is the serialized record of how a model was built. It represents the PretrainedConfig class generated during training or weight serialization, carrying only the model's architectural hyperparameters: the dimensions and structural flags that define the graph's shape (rather than training settings like learning rate). It contains neither the execution graph nor the tensor weights.

When you run inference with Hugging Face's transformers library, you rarely inspect this file directly: the library reads the configuration automatically and instantiates the architecture for you. When you port a model to MAX, that instantiation step is yours to write.

By reviewing the configuration fields, you can identify which values map directly to the MAX configuration schema, which flags require custom graph layers, and which entries might contain incorrect defaults that you need to verify against the weight dimensions. Use the sections below to interpret each field and translate it into a concrete architecture plan.

Read the config​

Before you write any architecture code, load and inspect the complete configuration to trace the model's architectural characteristics. The dimensions map directly to your MAX configuration, while non-default flags locate where the model's execution flow diverges from a standard transformer: a decoder-only stack of uniform attention and MLP blocks, like Llama.

You can read a model's configuration on Hugging Face by navigating to the Files and versions tab of its model repository and viewing the config.json file, or you can load the config locally and print each field. Start by looking at the complete picture rather than the handful of fields a model card mentions. To view the configuration without downloading the model's tensor weights, create a script named load_config.py with the following code:

load_config.py
from transformers import AutoConfig

# Replace with your model ID.
config = AutoConfig.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", trust_remote_code=True)
print(config)

Three fields in the printed configuration frame how you read the rest. The following snippet is from TinyLlama 1.1B, with architectures, model_type, and torch_dtype highlighted. The remaining properties are the dimensions and flags you'll map in the sections below:

config.json
LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "torch_dtype": "bfloat16",
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pad_token_id": null,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_parameters": {
    "rope_theta": 10000.0,
    "rope_type": "default"
  },
  "tie_word_embeddings": false,
  "transformers_version": "5.9.0",
  "use_cache": true,
  "vocab_size": 32000
}
  • architectures: Lists the architecture classes. The first entry (architectures[0]) is the exact string identifier you must register in your architecture's arch.py file.
  • torch_dtype (or dtype): Defines the precision dtype in which the weights are serialized, establishing the default encoding declared in the pipeline. (torch_dtype is deprecated and replaced by dtype.)
  • model_type: Identifies the model family (such as llama or qwen3), pointing to the closest baseline MAX architecture and indicating the expected weight serialization layout.

Beyond these core identifiers, the remaining parameters are highly model-specific. They reflect the exact architectural dimensions, flags, and design choices established by the model authors during the training phase.

Because instruct, base, and chat variants often differ in parameters like max_position_embeddings, rope_scaling, or tie_word_embeddings, always load the config for the exact checkpoint you're porting. Copying values from a sibling variant can result in a model that loads without errors but fails at long sequence lengths or on the final logits projection.

Explore other model configs​

TinyLlama is a model checkpoint using the standard Llama architecture (LlamaForCausalLM), where almost every field maps directly, and the only special consideration is rope_scaling if porting a larger context sibling. Examining the configuration files for other architectures supported by MAX illustrates the spectrum of divergence you encounter:

Qwen3's configuration file looks nearly identical to Llama's, specifying grouped-query attention, RoPE, and an explicit head_dim. However, its primary structural difference is completely absent from the configuration file. Qwen3 applies an RMSNorm to the query and key projections prior to attention. This operation is defined entirely in the model's execution code, rather than its configuration JSON.

config.json (condensed)
{
  "architectures": ["Qwen3ForCausalLM"],
  "model_type": "qwen3",
  "hidden_size": 4096,
  "num_hidden_layers": 36,
  "num_attention_heads": 32,
  "num_key_value_heads": 8,
  "head_dim": 128,
  "intermediate_size": 12288,
  "max_position_embeddings": 40960,
  "rope_theta": 1000000,
  "torch_dtype": "bfloat16"
}

Relying solely on this config leads you to reuse the Llama 3 architecture. The custom qwen3 architecture exists to accommodate this unlisted difference, which is why you must cross-reference config files with the source modeling scripts.

These configuration fields generally fall into three categories based on their architectural scope. This split establishes your development roadmap.

Add dimensions directly to model_config.py​

When mapping fields from config.json, start by identifying static dimensions and hyperparameters. These fields parameterize existing math (such as sizing matrices or defining scalar constants) but don't alter the sequence of operations. You'll define these fields in your architecture's model_config.py file, translating them into a typed configuration dataclass, in which each field is a class attribute. For example:

@dataclass(kw_only=True)
class MyModelConfig(ArchConfigWithKVCache):
    hidden_size: int
    num_hidden_layers: int
    num_attention_heads: int
    # ... one attribute per config field you map ...

The following list shows common dimensional parameters that mean the same thing across nearly all transformer architectures. While this list isn't exhaustive, these are the most common fields that pass directly into standard layers without requiring graph changes:

  • hidden_size
  • num_hidden_layers
  • num_attention_heads
  • num_key_value_heads
  • intermediate_size
  • vocab_size
  • max_position_embeddings
  • rms_norm_eps
  • hidden_act
  • tie_word_embeddings
  • rope_theta
  • Special token IDs (such as bos_token_id and eos_token_id)

TinyLlama represents this simpler class of configuration. You declare these fields in model_config.py (see Map config fields). Four of these parameters (num_key_value_heads, head_dim, num_hidden_layers, and max_seq_len) also size the KV cache that MAX allocates at startup (see the cache parameters).

The model_config.py file defines a typed configuration dataclass and implements an initialize_from_config() classmethod to parse the Hugging Face JSON. While most fields map directly, others require preprocessing: max_position_embeddings is clamped against the pipeline's runtime context limit to produce max_seq_len, and the dtype is selected based on the target quantization encoding rather than the config's torch_dtype.

Even simple dimensions require validation: hidden_act maps to a single parameter, but gelu, gelu_new, and gelu_tanh represent distinct activation functions. Binding the wrong activation function introduces subtle numerical drift in the MLP outputs.

Depending on the similarity of your model's architecture to existing MAX architectures, you implement these mappings as a standalone base configuration or a concise subclass:

Llama3Config is the baseline architecture configuration class that dense decoders extend. Open the configuration schema in llama3/model_config.py to inspect how each direct-map field is defined as a typed attribute:

llama3/model_config.py
@dataclass(kw_only=True)
class Llama3Config(ArchConfigWithStoredKVParams, ArchConfigWithKVCache):
    hidden_size: int
    num_attention_heads: int
    num_key_value_heads: int
    num_hidden_layers: int
    intermediate_size: int
    vocab_size: int
    max_seq_len: int
    rope_theta: float
    rms_norm_eps: float | None = None
    tie_word_embeddings: bool = False
    dtype: DType
    kv_params: KVCacheParams
    # ... plus rope scaling params, multipliers, devices, quantization ...

    @classmethod
    def initialize_from_config(cls, pipeline_config, huggingface_config, model_config=None):
        return cls(
            hidden_size=huggingface_config.hidden_size,
            num_attention_heads=huggingface_config.num_attention_heads,
            num_key_value_heads=huggingface_config.num_key_value_heads,
            num_hidden_layers=huggingface_config.num_hidden_layers,
            intermediate_size=huggingface_config.intermediate_size,
            vocab_size=huggingface_config.vocab_size,
            rope_theta=get_rope_theta(huggingface_config),
            # max_seq_len is clamped, not copied; dtype comes from the encoding.
            max_seq_len=Llama3Config.calculate_max_seq_len(pipeline_config, huggingface_config),
            dtype=supported_encoding_dtype(pipeline_config.model.quantization_encoding),
            # ... remaining fields ...
        )

Build graph layers for structural fields​

Structural fields signal that the model departs from a standard transformer architecture and requires custom layer logic. When you identify these fields, prepare to declare or extend custom layers in the model's graph implementation file (such as llama3.py or deepseekV3.py) using MAX modules:

  • Grouped-query or multi-query attention (num_key_value_heads below num_attention_heads, or multi_query) shares keys and values across query heads.
  • Sliding-window attention (sliding_window) or softcapping (attn_logit_softcapping, final_logit_softcapping) changes the attention mask or score path.
  • Mixture of experts (num_experts, num_experts_per_tok, router flags) turns the MLP into a routed set of experts.
  • Multi-head latent attention (q_lora_rank, kv_lora_rank, qk_nope_head_dim, qk_rope_head_dim) projects attention through a low-rank latent space.
  • QK-norm, post-norm, or MuP scalars (use_qk_norm, use_post_norm, embedding_multiplier, logits_scaling) add norms or scale factors inside the block.
  • Quantized weights (quantization_config) need a dequantization path for the released checkpoint (see Extend the compute graph).

Often, the closest baseline MAX architecture already implements some of these mechanisms. The following three examples show what it looks like to implement custom layer logic based on these structural fields:

Llama 3.1's only structural variation is rope_scaling. This requires no custom layers; instead, read the scaling parameters from the configuration and pass them to the rotary embedding constructor in llama3/model_config.py:

llama3/model_config.py
# Read rope_scaling into typed params if the config uses the llama3 scheme.
rope_scaling = getattr(huggingface_config, "rope_scaling", None)
if rope_scaling and rope_scaling.get("rope_type") == "llama3":
    rope_scaling_params = Llama3RopeScalingParams(
        factor=rope_scaling["factor"],
        low_freq_factor=rope_scaling["low_freq_factor"],
        high_freq_factor=rope_scaling["high_freq_factor"],
        orig_max_position=rope_scaling["original_max_position_embeddings"],
    )

# Build the RoPE embedding (scaling_params=None yields standard RoPE).
rope = Llama3RotaryEmbedding(
    dim=hidden_size,
    n_heads=num_attention_heads,
    theta=rope_theta,
    max_seq_len=max_seq_len,
    scaling_params=rope_scaling_params,
)

Recognize macro-architecture changes​

While most parameters dictate sizes or internal layer logic, some fields signal that the overall sequence of operations must change, requiring a completely new graph.

For example, a vision_config field implies the model is multimodal and needs a separate vision tower (like Pixtral) to embed images before passing them to the transformer. An encoder-decoder structure (like T5) breaks the standard decoder-only block sequence entirely. If a parameter alters how blocks are wired together or what inputs they accept, you must write a new macro-graph implementation rather than just swapping a custom layer into a standard pipeline.

Next steps​

Once you map the configuration fields to config properties and compute graph requirements, you can proceed with the remaining integration steps. The remainder of the model port involves registering your new architecture, mapping checkpoint weights, and validating output logits against the source framework. The model bring-up workflow provides an end-to-end view of this sequence.

These pages go deeper on the steps that follow:

Was this page helpful?