For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
InferenceSession
InferenceSessionβ
class max.engine.InferenceSession(devices=(), num_threads=None, *, custom_extensions=None)
Bases: object
Manages an inference session in which you can load and run models.
You need an InferenceSession instance to load a model as a
Model object. For example:
session = engine.InferenceSession(devices=[CPU()])
model = session.load(model_path)For workflows that need to separate compilation from weight binding,
use compile() followed by init() or init_all().
For example:
session = engine.InferenceSession(devices=[CPU()])
compiled = session.compile(model_path)
model = session.init(compiled)-
Parameters:
-
- devices (Iterable[Device]) β A list of devices on which to run inference. The host CPU is always included automatically.
- num_threads (int | None) β The number of execution threads. Defaults to
None, which lets the runtime choose automatically. - custom_extensions (CustomExtensionsType | None) β The extensions to load for the model. Supports
paths to a
.mojoc/.mojopkgcustom ops library or a.mojosource file.
compile()β
compile(model, *, custom_extensions=None)
Compiles a model without binding weights or device memory.
Use this when you want to separate compilation from initialization, for
example to populate a compile cache ahead of time, including in
cross-compilation scenarios where the target device may not be
attached. The returned CompiledModel requires initialization
before execution. Pass it to init() or init_all() to
produce an executable Model.
-
Parameters:
-
- model (str | Path | Module | Graph) β A
Graphinstance, amax.graph.Modulecontaining one or moremo.graphops, or the path to a saved model file (for example, a.meffile). - custom_extensions (CustomExtensionsType | None) β The extensions to load for the model.
Supports paths to
.mojopkgcustom ops.
- model (str | Path | Module | Graph) β A
-
Returns:
-
A
CompiledModelartifact ready to be initialized. -
Raises:
-
RuntimeError β If the path provided is invalid or compilation fails.
-
Return type:
debugβ
debug: DebugConfig = <max.engine.DebugConfig object>
devicesβ
The devices available to the session, including the host CPU.
enable_per_tensor_fp8_quantize()β
enable_per_tensor_fp8_quantize(mode)
Enables per-tensor FP8 quantization.
-
Parameters:
-
mode (str) β The enable/disable flag. Accepts
"false","off","no", or"0"to disable. Any other value enables per-tensor FP8 quantization. -
Return type:
-
None
gpu_profiling()β
gpu_profiling(mode)
Enables GPU profiling instrumentation for the session.
Works with NVIDIA Nsight Systems and Nsight Compute. When enabled, the runtime adds CUDA driver calls and NVTX markers that allow profiling tools to correlate GPU kernel executions with host-side code.
For example, to enable detailed profiling for Nsight Systems
analysis, call gpu_profiling() before load():
from max.engine import InferenceSession
from max.driver import Accelerator
session = InferenceSession(devices=[Accelerator()])
session.gpu_profiling("detailed")
model = session.load(my_graph)Then run it with nsys:
nsys profile --trace=cuda,nvtx python example.pyInstead of calling gpu_profiling() in code, you can set the
MODULAR_ENABLE_PROFILING environment variable when you call
nsys profile:
MODULAR_ENABLE_PROFILING=detailed nsys profile --trace=cuda,nvtx python script.pyBe aware that gpu_profiling() overrides the
MODULAR_ENABLE_PROFILING environment variable if also used.
Learn more in GPU profiling with Nsight Systems.
-
Parameters:
-
mode (Literal['off', 'on', 'detailed']) β
The profiling mode to set. One of:
off: Disable profiling (default).on: Enable basic profiling with NVTX markers for kernel correlation.detailed: Enable detailed profiling with additional Python-level NVTX markers.
-
Return type:
-
None
init()β
init(compiled, *, weights_registry=None)
Initializes a compiled model with weights for execution.
Use this to complete the second half of a compile()/init()
pair when the artifact contains a single model. For artifacts with
more than one model, use init_all().
-
Parameters:
-
- compiled (CompiledModel) β The compiled artifact returned by
compile(). - weights_registry (Mapping[str, DLPackArray] | None) β A mapping from model weight names to their
values. The values should be DLPack arrays. If an array is a
read-only NumPy array, you must ensure that its lifetime
extends beyond the lifetime of the model. Although
weights_registryis technically optional, youβll always need to load weights in practice.
- compiled (CompiledModel) β The compiled artifact returned by
-
Returns:
-
The initialized
Model, ready to execute. -
Return type:
init_all()β
init_all(compiled, *, weights_registry=None)
Initializes all models in a compiled artifact for execution.
Use this to complete the second half of a
compile()/init_all() pair. Returns one Model per
top-level graph in the artifact,
keyed by sym_name.
-
Parameters:
-
- compiled (CompiledModel) β The compiled artifact returned by
compile(). - weights_registry (Mapping[str, DLPackArray] | None) β A mapping from model weight names to their
values. See
init()for details.
- compiled (CompiledModel) β The compiled artifact returned by
-
Returns:
-
A mapping from each modelβs
sym_nameto its initializedModel, ready to execute. -
Return type:
load()β
load(model, *, custom_extensions=None, weights_registry=None)
Loads a trained model and compiles it for inference.
-
Parameters:
-
- model (str | Path | Graph) β A
Graphinstance, or the path to a saved model file (for example, a.meffile). - custom_extensions (CustomExtensionsType | None) β The extensions to load for the model.
Supports paths to
.mojoc/.mojopkgcustom ops. - weights_registry (Mapping[str, DLPackArray] | None) β A mapping from model weight names to their
values. The values should be DLPack arrays. If an array is a
read-only NumPy array, you must ensure that its lifetime
extends beyond the lifetime of the model. Although
weights_registryis technically optional, youβll always need to load weights in practice.
- model (str | Path | Graph) β A
-
Returns:
-
The loaded model, compiled and ready to execute.
-
Raises:
-
RuntimeError β If the path provided is invalid.
-
Return type:
load_all()β
load_all(model, *, custom_extensions=None, weights_registry=None)
Loads multiple models and compiles them for inference.
A compiled .mef artifact may contain more than one model (for
example, a vision encoder and a language model compiled together).
This method returns one Model per model encoded in the
artifact, keyed by the sym_name of the corresponding mo.graph
op (preserved through MEF serialization). For single-model
artifacts, the returned dict has exactly one entry.
-
Parameters:
-
- model (str | Path | Module | Graph) β A
max.graph.Modulecontaining one or moremo.graphops, the path to a saved multi-model file (for example, a.meffile), or a singleGraph. - custom_extensions (CustomExtensionsType | None) β The extensions to load for the model.
Supports paths to
.mojoc/.mojopkgcustom ops. - weights_registry (Mapping[str, DLPackArray] | None) β A mapping from model weight names to their
values. The values should be DLPack arrays. If an array is a
read-only NumPy array, you must ensure that its lifetime
extends beyond the lifetime of the model. Although
weights_registryis technically optional, youβll always need to load weights in practice.
- model (str | Path | Module | Graph) β A
-
Returns:
-
A mapping from each modelβs
sym_nameto its loadedModel, ready to execute. -
Raises:
-
RuntimeError β If the path provided is invalid.
-
Return type:
set_debug_print_options()β
set_debug_print_options(style=PrintStyle.COMPACT, precision=6, output_directory=None)
Sets the debug print options.
Affects debug printing across all model execution using the same
InferenceSession. See print().
Tensors saved with BINARY can be loaded using
max.driver.Buffer.mmap(), but youβll have to provide the
expected dtype and shape. Tensors saved with BINARY_MAX_CHECKPOINT
are saved with the shape and dtype information and can be loaded with
max.driver.buffer.load_max_buffer().
-
Parameters:
-
- style (str | PrintStyle) β The print style for tensor values. One of
COMPACT,FULL,BINARY,BINARY_MAX_CHECKPOINT, orNONE. - precision (int) β The digits of precision in the output, used when
styleisFULL. - output_directory (str | Path | None) β The directory to store output tensors, used
when
styleisBINARYorBINARY_MAX_CHECKPOINT.
- style (str | PrintStyle) β The print style for tensor values. One of
-
Raises:
-
- TypeError β If
styleis not a validPrintStyle, ifprecisionis not anintwhenstyleisFULL, or ifoutput_directoryis not astrorPath. - ValueError β If
output_directoryis empty whenstyleisBINARYorBINARY_MAX_CHECKPOINT.
- TypeError β If
-
Return type:
-
None
set_mojo_assert_level()β
set_mojo_assert_level(level)
Sets which Mojo asserts are kept in the compiled model.
-
Parameters:
-
level (AssertLevel) β The assert level to use. One of
AssertLevel.NONE,AssertLevel.WARN,AssertLevel.SAFE, orAssertLevel.ALL. -
Return type:
-
None
set_mojo_log_level()β
set_mojo_log_level(level)
Sets the verbosity of Mojo logging in the compiled model.
set_split_k_reduction_precision()β
set_split_k_reduction_precision(precision)
Sets the accumulation precision for split-k reductions in large matmuls.
use_fi_topk_kernel()β
use_fi_topk_kernel(mode)
Enables the fused-inference top-k kernel.
-
Parameters:
-
mode (str) β The enable/disable flag. Accepts
"false","off","no", or"0"to disable. Any other value enables the fused-inference top-k kernel. -
Return type:
-
None
use_old_top_k_kernel()β
use_old_top_k_kernel(mode)
Falls back to the previous top-k kernel implementation.
By default, the session uses a newer top-k kernel. Use this fallback only if you encounter correctness or performance issues with the default kernel.
-
Parameters:
-
mode (str) β The enable/disable flag. Accepts
"false","off","no", or"0"to disable. Any other value enables the old top-k kernel. -
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!