Skip to main content

Python class

InferenceSession

InferenceSession

class max.engine.InferenceSession(devices=(), num_threads=None, *, custom_extensions=None)

source

Bases: object

Manages an inference session in which you can load and run models.

You need an instance of this to load a model as a Model object. For example:

session = engine.InferenceSession(devices=[CPU()])
model_path = Path('bert-base-uncased')
model = session.load(model_path)

Construct an inference session.

Parameters:

  • num_threads (int | None) – Number of threads to use for the inference session. This defaults to the number of physical cores on your machine.
  • devices (Iterable[Device]) – A list of devices on which to run inference. The host CPU is always included automatically.
  • custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to a .mojopkg custom ops library or a .mojo source file.

debug

debug: DebugConfig = <max.engine.DebugConfig object>

source

devices

property devices: list[Device]

source

A list of available devices.

enable_per_tensor_fp8_quantize()

enable_per_tensor_fp8_quantize(mode)

source

Enables per-tensor FP8 quantization.

Parameters:

mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.

Return type:

None

gpu_profiling()

gpu_profiling(mode)

source

Enables GPU profiling instrumentation for the session.

This enables GPU profiling instrumentation that works with NVIDIA Nsight Systems and Nsight Compute. When enabled, the runtime adds CUDA driver calls and NVTX markers that allow profiling tools to correlate GPU kernel executions with host-side code.

For example, to enable detailed profiling for Nsight Systems analysis, call gpu_profiling() before load():

from max.engine import InferenceSession
from max.driver import Accelerator

session = InferenceSession(devices=[Accelerator()])
session.gpu_profiling("detailed")
model = session.load(my_graph)

Then run it with nsys:

nsys profile --trace=cuda,nvtx python example.py

Or, instead of calling session.gpu_profiling() in the code, you can set the MODULAR_ENABLE_PROFILING environment variable when you call nsys profile:

MODULAR_ENABLE_PROFILING=detailed nsys profile --trace=cuda,nvtx python script.py

Beware that gpu_profiling() overrides the MODULAR_ENABLE_PROFILING environment variable if also used.

Parameters:

mode (Literal['off', 'on', 'detailed']) –

The profiling mode to set. One of:

  • off: Disable profiling (default).
  • on: Enable basic profiling with NVTX markers for kernel correlation.
  • detailed: Enable detailed profiling with additional Python-level NVTX markers.

Return type:

None

load()

load(model, *, custom_extensions=None, weights_registry=None)

source

Loads a trained model and compiles it for inference.

Parameters:

  • model (str | Path | Graph) – Path to a model.
  • custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to .mojopkg custom ops.
  • weights_registry (Mapping[str, DLPackArray] | None) – Model weight names mapped to their values. The values should be dlpack arrays. If an array is a read-only numpy array, you must ensure that its lifetime extends beyond the lifetime of the model. Although weights_registry is technically optional, you’ll always need to load weights in practice.

Returns:

The loaded model, compiled and ready to execute.

Raises:

RuntimeError – If the path provided is invalid.

Return type:

Model

load_all()

load_all(model, *, custom_extensions=None, weights_registry=None)

source

Loads all trained models and compiles it for inference.

A compiled MEF artifact may contain more than one model (for example a vision encoder and a language model compiled together). This method returns one Model per model encoded in the artifact, in MEF order. For single-model artifacts the returned list has exactly one element.

Parameters:

  • model (str | Path | Graph) – Path to a model.
  • custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to .mojopkg custom ops.
  • weights_registry (Mapping[str, DLPackArray] | None) – Model weight names mapped to their values. The values should be dlpack arrays. If an array is a read-only numpy array, you must ensure that its lifetime extends beyond the lifetime of the model. Although weights_registry is technically optional, you’ll always need to load weights in practice.

Returns:

The loaded models, compiled and ready to execute, one per model primitive encoded in the compiled artifact.

Raises:

RuntimeError – If the path provided is invalid.

Return type:

list[Model]

set_debug_print_options()

set_debug_print_options(style=PrintStyle.COMPACT, precision=6, output_directory=None)

source

Sets the debug print options.

See Value.print.

This affects debug printing across all model execution using the same InferenceSession.

Tensors saved with BINARY can be loaded using max.driver.Buffer.mmap(), but you will have to provide the expected dtype and shape.

Tensors saved with BINARY_MAX_CHECKPOINT are saved with the shape and dtype information, and can be loaded with max.driver.buffer.load_max_buffer().

Warning: Even with style set to NONE, debug print ops in the graph can stop optimizations. If you see performance issues, try fully removing debug print ops.

Parameters:

  • style (str | PrintStyle) – How the values will be printed. Can be COMPACT, FULL, BINARY, BINARY_MAX_CHECKPOINT or NONE.
  • precision (int) – If the style is FULL, the digits of precision in the output.
  • output_directory (str | Path | None) – If the style is BINARY, the directory to store output tensors.

Return type:

None

set_mojo_assert_level()

set_mojo_assert_level(level)

source

Sets which mojo asserts are kept in the compiled model.

Parameters:

level (AssertLevel)

Return type:

None

set_mojo_log_level()

set_mojo_log_level(level)

source

Sets the verbosity of mojo logging in the compiled model.

Parameters:

level (str | LogLevel)

Return type:

None

set_split_k_reduction_precision()

set_split_k_reduction_precision(precision)

source

Sets the accumulation precision for split k reductions in large matmuls.

Parameters:

precision (str | SplitKReductionPrecision)

Return type:

None

use_fi_topk_kernel()

use_fi_topk_kernel(mode)

source

Enables the fused-inference top-k kernel.

Parameters:

mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.

Return type:

None

use_old_top_k_kernel()

use_old_top_k_kernel(mode)

source

Enables the old top-k kernel.

Default is to use the new top-k kernel to keep it consistent with max/kernels/src/nn/topk.mojo

Parameters:

mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.

Return type:

None