Speculative decoding
Speculative decoding accelerates LLM token generation without changing the model's outputs. A smaller, faster draft step proposes several candidate tokens, and the larger target model verifies them in one forward pass. MAX accepts the prefix the target agrees with and resamples at the first disagreement, so quality matches running the target alone.
The speedup comes from batching verification across K candidate positions. When the target accepts all K drafts, you get K+1 tokens per step instead of one, converting a memory-bandwidth-bound workload into one that better uses available compute.
Supported methodsβ
MAX supports common speculative-decoding methods like EAGLE, EAGLE3, and
MTP. For experimenting with arbitrary target/draft pairs, use
standalone.
| Method | Draft source | Supported targets and hardware |
|---|---|---|
eagle / eagle3 | A trained EAGLE draft that shares the target's embedding and lm_head | Llama 3 (1 GPU), Kimi K2.5 (8Γ B200). |
mtp | A native multi-token prediction head inside the target checkpoint | DeepSeek V3 and derivatives (8Γ B200). |
standalone | A separate small model you pass with --draft-model-path | Any target with a same-family draft, on any GPU the pair fits on. |
Serve with speculative decodingβ
Pick a tab for the method you want to run. Each example starts max serve with the right target and draft, then sends a chat-completion
request to the local endpoint from the OpenAI Python client.
To call the endpoint, install the OpenAI Python client:
- pixi
- uv
- pip
- conda
pixi add openaiuv add openaipip install openaiconda install openai- EAGLE
- MTP
- Standalone
Serve Llama 3.1 8B Instruct with a pretrained EAGLE checkpoint as the
draft:
max serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-method eagle \
--draft-model-path atomicapple0/EAGLE-LLaMA3.1-Instruct-8B \
--num-speculative-tokens 2 \
--devices gpuWe use atomicapple0/EAGLE-LLaMA3.1-Instruct-8B because it ships
safetensors weights. For the full list of weight formats MAX supports,
see
WeightsFormat.
Once the server prints Server ready on http://0.0.0.0:8000, send a
chat-completion request:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the benefits of speculative decoding?"},
],
max_tokens=500,
)
print(response.choices[0].message.content)Serve DeepSeek V3 with its built-in MTP head. You don't pass a
separate draft model because the MTP head already lives in the target
checkpoint:
max serve \
--model deepseek-ai/DeepSeek-V3 \
--speculative-method mtp \
--num-speculative-tokens 2 \
--devices gpu:0,1,2,3,4,5,6,7Once the server prints Server ready on http://0.0.0.0:8000, send a
chat-completion request:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the benefits of speculative decoding?"},
],
max_tokens=500,
)
print(response.choices[0].message.content)Serve Llama 3.1 8B Instruct with Llama 3.2 1B Instruct as the
draft:
max serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-method standalone \
--draft-model-path meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 2 \
--devices gpuOnce the server prints Server ready on http://0.0.0.0:8000, send a
chat-completion request:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the benefits of speculative decoding?"},
],
max_tokens=500,
)
print(response.choices[0].message.content)Read the startup architecture nameβ
For EAGLE and MTP, MAX reports a unified architecture at startup, for
example UnifiedEagleLlama3ForCausalLM or
UnifiedMTPDeepseekV3ForCausalLM. That's expected: MAX compiles the
target, draft, and verifier into a single graph. standalone runs the
target and draft as separate models and reports the target's normal
architecture.
Monitor acceptance ratesβ
Once you send traffic, the scheduler logs per-batch acceptance stats in this format:
Draft Tokens: 145/160 (90.62%) accepted, Acceptance Len: 1.45 / 2 toks, Per-Pos: [95%, 86%] |Each field means:
- Accepted / generated: tokens the target confirmed, over tokens the draft proposed.
- Acceptance length: average number of drafted tokens accepted per
verification pass. A value of
1.45 / 2means on average 1.45 of the 2 drafted tokens survive verification. - Per-position: acceptance rate at each draft position, conditional on all earlier positions accepting. Later positions are always rarer.
Low acceptance rates (below roughly 50%) usually mean the draft doesn't
match the target well. Try a smaller --num-speculative-tokens or a
better-matched draft checkpoint.
Tune speculative decodingβ
The following flags control how MAX drafts tokens and how verification
decides to accept them. The first three map to fields on
SpeculativeConfig
when you configure a pipeline programmatically. The last two live on
the draft model's config and on KVCacheConfig.
--num-speculative-tokens: number of tokens the draft proposes per step. Default is2. Larger values raise peak speedup but hurt acceptance at later positions.--rejection-sampling-strategy: how the target verifies drafted tokens. Acceptsgreedy,residual,typical-acceptance, orlogit-comparison. When unset, EAGLE and MTP default totypical-acceptance;standalonedefaults toresidual.--synthetic-acceptance-rate: benchmarking-only knob that accepts each drafted token with a calibrated probability, ignoring real logits. Use this to model hypothetical speedups without changing the draft.--draft-devices: device list for the draft model. Useful when you want the draft and target on different GPUs.--device-memory-utilization: fraction of device memory MAX may use. Speculative decoding allocates KV cache for both the target and the draft, so leave more headroom than you would for single-model serving.
For EAGLE and MTP, MAX auto-enables the overlap scheduler and device graph capture. Both reduce per-step latency and need no additional flags.
Compatibility and limitsβ
These restrictions apply when speculative decoding is enabled:
--max-num-stepsmust be1. Speculative-decoding pipelines manage multi-step KV allocation internally, and any other value raises a configuration error.--enable-echoisn't supported alongside speculative decoding.- Structured output (JSON-schema-constrained generation) isn't supported alongside speculative decoding.
- The target and draft must share the same
max_length. - Repetition, frequency, and presence penalties are disabled when a
separate draft model is configured (
--draft-model-path). That covers EAGLE andstandalone. MTP keeps the draft head inside the target checkpoint, so it supports penalties.
Next stepsβ
You can combine speculative decoding with prefix caching and with disaggregated inference. The following topics go deeper on performance and deployment:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!