Skip to main content

Speculative decoding

Speculative decoding accelerates LLM token generation without changing the model's outputs. A smaller, faster draft step proposes several candidate tokens, and the larger target model verifies them in one forward pass. MAX accepts the prefix the target agrees with and resamples at the first disagreement, so quality matches running the target alone.

The speedup comes from batching verification across K candidate positions. When the target accepts all K drafts, you get K+1 tokens per step instead of one, converting a memory-bandwidth-bound workload into one that better uses available compute.

Supported methods​

MAX supports common speculative-decoding methods like EAGLE, EAGLE3, and MTP. For experimenting with arbitrary target/draft pairs, use standalone.

MethodDraft sourceSupported targets and hardware
eagle / eagle3A trained EAGLE draft that shares the target's embedding and lm_headLlama 3 (1 GPU), Kimi K2.5 (8Γ— B200).
mtpA native multi-token prediction head inside the target checkpointDeepSeek V3 and derivatives (8Γ— B200).
standaloneA separate small model you pass with --draft-model-pathAny target with a same-family draft, on any GPU the pair fits on.

Serve with speculative decoding​

Pick a tab for the method you want to run. Each example starts max serve with the right target and draft, then sends a chat-completion request to the local endpoint from the OpenAI Python client.

To call the endpoint, install the OpenAI Python client:

pixi add openai

Serve Llama 3.1 8B Instruct with a pretrained EAGLE checkpoint as the draft:

max serve \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-method eagle \
  --draft-model-path atomicapple0/EAGLE-LLaMA3.1-Instruct-8B \
  --num-speculative-tokens 2 \
  --devices gpu

We use atomicapple0/EAGLE-LLaMA3.1-Instruct-8B because it ships safetensors weights. For the full list of weight formats MAX supports, see WeightsFormat.

Once the server prints Server ready on http://0.0.0.0:8000, send a chat-completion request:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of speculative decoding?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Read the startup architecture name​

For EAGLE and MTP, MAX reports a unified architecture at startup, for example UnifiedEagleLlama3ForCausalLM or UnifiedMTPDeepseekV3ForCausalLM. That's expected: MAX compiles the target, draft, and verifier into a single graph. standalone runs the target and draft as separate models and reports the target's normal architecture.

Monitor acceptance rates​

Once you send traffic, the scheduler logs per-batch acceptance stats in this format:

Draft Tokens: 145/160 (90.62%) accepted, Acceptance Len: 1.45 / 2 toks, Per-Pos: [95%, 86%] |

Each field means:

  • Accepted / generated: tokens the target confirmed, over tokens the draft proposed.
  • Acceptance length: average number of drafted tokens accepted per verification pass. A value of 1.45 / 2 means on average 1.45 of the 2 drafted tokens survive verification.
  • Per-position: acceptance rate at each draft position, conditional on all earlier positions accepting. Later positions are always rarer.

Low acceptance rates (below roughly 50%) usually mean the draft doesn't match the target well. Try a smaller --num-speculative-tokens or a better-matched draft checkpoint.

Tune speculative decoding​

The following flags control how MAX drafts tokens and how verification decides to accept them. The first three map to fields on SpeculativeConfig when you configure a pipeline programmatically. The last two live on the draft model's config and on KVCacheConfig.

  • --num-speculative-tokens: number of tokens the draft proposes per step. Default is 2. Larger values raise peak speedup but hurt acceptance at later positions.
  • --rejection-sampling-strategy: how the target verifies drafted tokens. Accepts greedy, residual, typical-acceptance, or logit-comparison. When unset, EAGLE and MTP default to typical-acceptance; standalone defaults to residual.
  • --synthetic-acceptance-rate: benchmarking-only knob that accepts each drafted token with a calibrated probability, ignoring real logits. Use this to model hypothetical speedups without changing the draft.
  • --draft-devices: device list for the draft model. Useful when you want the draft and target on different GPUs.
  • --device-memory-utilization: fraction of device memory MAX may use. Speculative decoding allocates KV cache for both the target and the draft, so leave more headroom than you would for single-model serving.

For EAGLE and MTP, MAX auto-enables the overlap scheduler and device graph capture. Both reduce per-step latency and need no additional flags.

Compatibility and limits​

These restrictions apply when speculative decoding is enabled:

  • --max-num-steps must be 1. Speculative-decoding pipelines manage multi-step KV allocation internally, and any other value raises a configuration error.
  • --enable-echo isn't supported alongside speculative decoding.
  • Structured output (JSON-schema-constrained generation) isn't supported alongside speculative decoding.
  • The target and draft must share the same max_length.
  • Repetition, frequency, and presence penalties are disabled when a separate draft model is configured (--draft-model-path). That covers EAGLE and standalone. MTP keeps the draft head inside the target checkpoint, so it supports penalties.

Next steps​

You can combine speculative decoding with prefix caching and with disaggregated inference. The following topics go deeper on performance and deployment:

Was this page helpful?