For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Speculative decoding

Speculative decoding accelerates LLM token generation without changing the model's outputs. A smaller, faster draft step proposes several candidate tokens, and the larger target model verifies them in one forward pass. MAX accepts the prefix the target agrees with and resamples at the first disagreement, so quality matches running the target alone.

The speedup comes from batching verification across K candidate positions. When the target accepts all K drafts, you get K+1 tokens per step instead of one, converting a memory-bandwidth-bound workload into one that better uses available compute.

Caution

Speculative decoding is in preview. APIs and behavior may change as the implementation matures.

We recommend running speculative decoding on an accelerator. For more information, see System requirements.

Supported methods

MAX supports common speculative-decoding methods like EAGLE, EAGLE3, and MTP. For experimenting with arbitrary target/draft pairs, use standalone.

Method	Draft source	Supported targets and hardware
`eagle` / `eagle3`	A trained EAGLE draft that shares the target's embedding and `lm_head`	Llama 3 (1 GPU), Kimi K2.5 and K2.6 (8× B200).
`mtp`	A native multi-token prediction head inside the target checkpoint	DeepSeek V3 and derivatives (8× B200), Gemma 4 (1 GPU).
`standalone`	A separate small model you pass with `--draft-model-path`	Any target with a same-family draft, on any GPU the pair fits on.

Serve with speculative decoding

Pick a tab for the method you want to run. Each example starts max serve with the right target and draft, then sends a chat-completion request to the local endpoint from the OpenAI Python client.

To call the endpoint, install the OpenAI Python client:

pixi
uv
pip
conda

pixi add openai

uv add openai

pip install openai

conda install openai

EAGLE
MTP
Standalone

Serve Llama 3.1 8B Instruct with a pretrained EAGLE checkpoint as the draft:

max serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-method eagle \
  --draft-model-path atomicapple0/EAGLE-LLaMA3.1-Instruct-8B \
  --num-speculative-tokens 2 \
  --devices gpu

We use atomicapple0/EAGLE-LLaMA3.1-Instruct-8B because it ships safetensors weights. For the full list of weight formats MAX supports, see WeightsFormat.

Once the server prints Server ready on http://0.0.0.0:8000, send a chat-completion request:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of speculative decoding?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Serve DeepSeek V3 with its built-in MTP head. You don't pass a separate draft model because the MTP head already lives in the target checkpoint:

max serve \
  --model deepseek-ai/DeepSeek-V3 \
  --speculative-method mtp \
  --num-speculative-tokens 2 \
  --devices gpu:0,1,2,3,4,5,6,7

Once the server prints Server ready on http://0.0.0.0:8000, send a chat-completion request:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of speculative decoding?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Serve Llama 3.1 8B Instruct with Llama 3.2 1B Instruct as the draft:

max serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-method standalone \
  --draft-model-path meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 2 \
  --devices gpu

Once the server prints Server ready on http://0.0.0.0:8000, send a chat-completion request:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of speculative decoding?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Read the startup architecture name

For EAGLE and MTP, MAX reports a unified architecture at startup, for example UnifiedEagleLlama3ForCausalLM, UnifiedMTPDeepseekV3ForCausalLM, or UnifiedMTPGemma4ForCausalLM. That's expected: MAX compiles the target, draft, and verifier into a single graph. standalone runs the target and draft as separate models and reports the target's normal architecture.

Monitor acceptance rates

Once you send traffic, the scheduler logs per-batch acceptance stats in this format:

Draft Tokens: 145/160 (90.62%) accepted, Acceptance Len: 1.45 / 2 toks, Per-Pos: [95%, 86%] |

Each field means:

Accepted / generated: tokens the target confirmed, over tokens the draft proposed.
Acceptance length: average number of drafted tokens accepted per verification pass. A value of 1.45 / 2 means on average 1.45 of the 2 drafted tokens survive verification.
Per-position: acceptance rate at each draft position, conditional on all earlier positions accepting. Later positions are always rarer.

Low acceptance rates (below roughly 50%) usually mean the draft doesn't match the target well. Try a smaller --num-speculative-tokens or a better-matched draft checkpoint.

Tune speculative decoding

The following flags control how MAX drafts tokens and how verification decides to accept them. The first three map to fields on SpeculativeConfig when you configure a pipeline programmatically. The last two live on the draft model's config and on KVCacheConfig.

--num-speculative-tokens: number of tokens the draft proposes per step. Default is 2. Larger values raise peak speedup but hurt acceptance at later positions.
--rejection-sampling-strategy: how the target verifies drafted tokens. Accepts greedy, residual, typical-acceptance, or logit-comparison. When unset, EAGLE and MTP default to typical-acceptance; standalone defaults to residual.
--synthetic-acceptance-rate: benchmarking-only knob that accepts each drafted token with a calibrated probability, ignoring real logits. Use this to model hypothetical speedups without changing the draft.
--draft-devices: device list for the draft model. Useful when you want the draft and target on different GPUs.
--device-memory-utilization: fraction of device memory MAX may use. Speculative decoding allocates KV cache for both the target and the draft, so leave more headroom than you would for single-model serving.

For EAGLE and MTP, MAX auto-enables the overlap scheduler and device graph capture. Both reduce per-step latency and need no additional flags.

Compatibility and limits

These restrictions apply when speculative decoding is enabled:

--enable-echo isn't supported alongside speculative decoding.
Structured output (JSON-schema-constrained generation) isn't supported alongside speculative decoding.
The target and draft must share the same max_length.
Repetition, frequency, and presence penalties are disabled when a separate draft model is configured (--draft-model-path). That covers EAGLE and standalone. MTP keeps the draft head inside the target checkpoint, so it supports penalties.

Next steps

You can combine speculative decoding with prefix caching and with disaggregated inference. The following topics go deeper on performance and deployment:

Supported methods​

Serve with speculative decoding​

Read the startup architecture name​

Monitor acceptance rates​

Tune speculative decoding​

Compatibility and limits​

Next steps​

Supported methods

Serve with speculative decoding

Read the startup architecture name

Monitor acceptance rates

Tune speculative decoding

Compatibility and limits

Next steps