Python module
engine
The APIs in this module allow you to run inference with MAX Engine—a graph compiler and runtime that accelerates your AI models on a wide variety of hardware.
InferenceSession
class max.engine.InferenceSession(devices, num_threads=None, *, custom_extensions=None)
Manages an inference session in which you can load and run models.
You need an instance of this to load a model as a Model object.
For example:
session = engine.InferenceSession(devices=[CPU()])
model_path = Path('bert-base-uncased')
model = session.load(model_path)Construct an inference session.
-
Parameters:
-
- num_threads (int | None) – Number of threads to use for the inference session. This defaults to the number of physical cores on your machine.
- devices (Iterable[Device]) – A list of devices on which to run inference. Default is the host CPU only.
- custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to a .mojopkg custom ops library or a .mojo source file.
devices
A list of available devices.
gpu_profiling()
gpu_profiling(mode)
Enables GPU profiling instrumentation for the session.
This enables GPU profiling instrumentation that works with NVIDIA Nsight Systems and Nsight Compute. When enabled, the runtime adds CUDA driver calls and NVTX markers that allow profiling tools to correlate GPU kernel executions with host-side code.
For example, to enable detailed profiling for Nsight Systems analysis,
call gpu_profiling() before load():
from max.engine import InferenceSession, GPUProfilingMode
from max.driver import Accelerator
session = InferenceSession(devices=[Accelerator()])
session.gpu_profiling(GPUProfilingMode.DETAILED)
model = session.load(my_graph)Then run it with nsys:
nsys profile --trace=cuda,nvtx python example.pyOr, instead of calling session.gpu_profiling() in the code, you can
set the MODULAR_ENABLE_PROFILING environment variable when you call
nsys profile:
MODULAR_ENABLE_PROFILING=detailed nsys profile --trace=cuda,nvtx python script.pyBeware that gpu_profiling() overrides the
MODULAR_ENABLE_PROFILING environment variable if also used.
-
Parameters:
-
mode (GPUProfilingMode) –
The profiling mode to set. One of:
GPUProfilingMode.OFF: Disable profiling (default).GPUProfilingMode.ON: Enable basic profiling with NVTX markers for kernel correlation.GPUProfilingMode.DETAILED: Enable detailed profiling with additional Python-level NVTX markers.
-
Return type:
-
None
load()
load(model, *, custom_extensions=None, weights_registry=None)
Loads a trained model and compiles it for inference.
-
Parameters:
-
- model (str | Path | Graph) – Path to a model.
- custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to .mojopkg custom ops.
- weights_registry (Mapping[str, DLPackArray] | None) – A mapping from names of model weights’ names to their values. The values are currently expected to be dlpack arrays. If an array is a read-only numpy array, the user must ensure that its lifetime extends beyond the lifetime of the model.
-
Returns:
-
The loaded model, compiled and ready to execute.
-
Raises:
-
RuntimeError – If the path provided is invalid.
-
Return type:
set_mojo_assert_level()
set_mojo_assert_level(level)
Sets which mojo asserts are kept in the compiled model.
-
Parameters:
-
level (AssertLevel)
-
Return type:
-
None
set_mojo_log_level()
set_mojo_log_level(level)
Sets the verbosity of mojo logging in the compiled model.
set_split_k_reduction_precision()
set_split_k_reduction_precision(precision)
Sets the accumulation precision for split k reductions in large matmuls.
-
Parameters:
-
precision (str | SplitKReductionPrecision)
-
Return type:
-
None
use_old_top_k_kernel()
use_old_top_k_kernel(mode)
Enables the old top-k kernel.
Default is to use the new top-k kernel to keep it consistent with max/kernels/src/nn/topk.mojo
-
Parameters:
-
mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.
-
Return type:
-
None
Model
class max.engine.Model
A loaded model that you can execute.
Do not instantiate this class directly. Instead, create it with
InferenceSession.
__call__()
__call__(*args, **kwargs)
Call self as a function.
capture()
capture(*inputs)
Capture execution into a device graph keyed by input shapes/dtypes.
Capture is best-effort and model-dependent. If the model issues capture-unsafe operations (for example, host-device synchronization), graph capture may fail. Callers should choose capture-safe execution paths.
debug_verify_replay()
debug_verify_replay(*inputs)
Verify inputs match the captured graph’s baseline trace.
execute()
execute(*args)
input_metadata
property input_metadata
Metadata about the model’s input tensors, as a list of
TensorSpec objects.
For example, you can print the input tensor names, shapes, and dtypes:
for tensor in model.input_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')output_metadata
property output_metadata
Metadata about the model’s output tensors, as a list of
TensorSpec objects.
For example, you can print the output tensor names, shapes, and dtypes:
for tensor in model.output_metadata:
print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')replay()
replay(*inputs)
Replay the captured device graph for these inputs.
GPUProfilingMode
class max.engine.GPUProfilingMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
The supported modes for GPU profiling.
GPU profiling modes control the level of instrumentation when profiling MAX applications with NVIDIA Nsight Systems or Nsight Compute. Higher levels provide more detail but may introduce additional overhead.
DETAILED
DETAILED = 'detailed'
Enable detailed GPU profiling with additional NVTX markers from Python code. This mode provides the most visibility into which Python operations correspond to which GPU kernels, but has the highest overhead.
OFF
OFF = 'off'
Disable GPU profiling instrumentation. This is the default mode and incurs no profiling overhead.
ON
ON = 'on'
Enable basic GPU profiling. Adds CUDA driver calls and NVTX markers for correlating kernel executions with host-side code.
LogLevel
class max.engine.LogLevel(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
The LogLevel specifies the log level used by the Mojo Ops.
CRITICAL
CRITICAL = 'critical'
DEBUG
DEBUG = 'debug'
ERROR
ERROR = 'error'
INFO
INFO = 'info'
NOTSET
NOTSET = 'notset'
TRACE
TRACE = 'trace'
WARNING
WARNING = 'warning'
TensorSpec
class max.engine.TensorSpec
Defines the properties of a tensor, including its name, shape and data type.
For usage examples, see Model.input_metadata.
dtype
property dtype
A tensor data type.
name
property name
A tensor name.
shape
property shape
The shape of the tensor as a list of integers.
If a dimension size is unknown/dynamic (such as the batch size), its
value is None.
CustomExtensionsType
max.engine.CustomExtensionsType = collections.abc.Sequence[str | pathlib.Path] | str | pathlib.Path
Represent a PEP 604 union type
E.g. for int | str
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!