Python class
InferenceSession
InferenceSession
class max.engine.InferenceSession(devices=(), num_threads=None, *, custom_extensions=None)
Bases: object
Manages an inference session in which you can load and run models.
You need an instance of this to load a model as a Model object.
For example:
session = engine.InferenceSession(devices=[CPU()])
model_path = Path('bert-base-uncased')
model = session.load(model_path)Construct an inference session.
-
Parameters:
-
- num_threads (int | None) – Number of threads to use for the inference session. This defaults to the number of physical cores on your machine.
- devices (Iterable[Device]) – A list of devices on which to run inference. The host CPU is always included automatically.
- custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to a .mojopkg custom ops library or a .mojo source file.
debug
debug: DebugConfig = <max.engine.DebugConfig object>
devices
A list of available devices.
enable_per_tensor_fp8_quantize()
enable_per_tensor_fp8_quantize(mode)
Enables per-tensor FP8 quantization.
-
Parameters:
-
mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.
-
Return type:
-
None
gpu_profiling()
gpu_profiling(mode)
Enables GPU profiling instrumentation for the session.
This enables GPU profiling instrumentation that works with NVIDIA Nsight Systems and Nsight Compute. When enabled, the runtime adds CUDA driver calls and NVTX markers that allow profiling tools to correlate GPU kernel executions with host-side code.
For example, to enable detailed profiling for Nsight Systems analysis,
call gpu_profiling() before load():
from max.engine import InferenceSession
from max.driver import Accelerator
session = InferenceSession(devices=[Accelerator()])
session.gpu_profiling("detailed")
model = session.load(my_graph)Then run it with nsys:
nsys profile --trace=cuda,nvtx python example.pyOr, instead of calling session.gpu_profiling() in the code, you can
set the MODULAR_ENABLE_PROFILING environment variable when you call
nsys profile:
MODULAR_ENABLE_PROFILING=detailed nsys profile --trace=cuda,nvtx python script.pyBeware that gpu_profiling() overrides the
MODULAR_ENABLE_PROFILING environment variable if also used.
-
Parameters:
-
mode (Literal['off', 'on', 'detailed']) –
The profiling mode to set. One of:
off: Disable profiling (default).on: Enable basic profiling with NVTX markers for kernel correlation.detailed: Enable detailed profiling with additional Python-level NVTX markers.
-
Return type:
-
None
load()
load(model, *, custom_extensions=None, weights_registry=None)
Loads a trained model and compiles it for inference.
-
Parameters:
-
- model (str | Path | Graph) – Path to a model.
- custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to .mojopkg custom ops.
- weights_registry (Mapping[str, DLPackArray] | None) – Model weight names mapped to
their values. The values should be dlpack
arrays. If an array is a read-only numpy array, you must
ensure that its lifetime extends beyond the lifetime of the model.
Although
weights_registryis technically optional, you’ll always need to load weights in practice.
-
Returns:
-
The loaded model, compiled and ready to execute.
-
Raises:
-
RuntimeError – If the path provided is invalid.
-
Return type:
load_all()
load_all(model, *, custom_extensions=None, weights_registry=None)
Loads all trained models and compiles it for inference.
A compiled MEF artifact may contain more than one model (for example a
vision encoder and a language model compiled together). This method
returns one Model per model encoded in the artifact, in MEF
order. For single-model artifacts the returned list has exactly one
element.
-
Parameters:
-
- model (str | Path | Graph) – Path to a model.
- custom_extensions (CustomExtensionsType | None) – The extensions to load for the model. Supports paths to .mojopkg custom ops.
- weights_registry (Mapping[str, DLPackArray] | None) – Model weight names mapped to
their values. The values should be dlpack
arrays. If an array is a read-only numpy array, you must
ensure that its lifetime extends beyond the lifetime of the model.
Although
weights_registryis technically optional, you’ll always need to load weights in practice.
-
Returns:
-
The loaded models, compiled and ready to execute, one per model primitive encoded in the compiled artifact.
-
Raises:
-
RuntimeError – If the path provided is invalid.
-
Return type:
set_debug_print_options()
set_debug_print_options(style=PrintStyle.COMPACT, precision=6, output_directory=None)
Sets the debug print options.
See Value.print.
This affects debug printing across all model execution using the same InferenceSession.
Tensors saved with BINARY can be loaded using max.driver.Buffer.mmap(), but you will have to provide the expected dtype and shape.
Tensors saved with BINARY_MAX_CHECKPOINT are saved with the shape and dtype information, and can be loaded with max.driver.buffer.load_max_buffer().
Warning: Even with style set to NONE, debug print ops in the graph can stop optimizations. If you see performance issues, try fully removing debug print ops.
-
Parameters:
-
- style (str | PrintStyle) – How the values will be printed. Can be COMPACT, FULL, BINARY, BINARY_MAX_CHECKPOINT or NONE.
- precision (int) – If the style is FULL, the digits of precision in the output.
- output_directory (str | Path | None) – If the style is BINARY, the directory to store output tensors.
-
Return type:
-
None
set_mojo_assert_level()
set_mojo_assert_level(level)
Sets which mojo asserts are kept in the compiled model.
-
Parameters:
-
level (AssertLevel)
-
Return type:
-
None
set_mojo_log_level()
set_mojo_log_level(level)
Sets the verbosity of mojo logging in the compiled model.
set_split_k_reduction_precision()
set_split_k_reduction_precision(precision)
Sets the accumulation precision for split k reductions in large matmuls.
-
Parameters:
-
precision (str | SplitKReductionPrecision)
-
Return type:
-
None
use_fi_topk_kernel()
use_fi_topk_kernel(mode)
Enables the fused-inference top-k kernel.
-
Parameters:
-
mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.
-
Return type:
-
None
use_old_top_k_kernel()
use_old_top_k_kernel(mode)
Enables the old top-k kernel.
Default is to use the new top-k kernel to keep it consistent with max/kernels/src/nn/topk.mojo
-
Parameters:
-
mode (str) – String to enable/disable. Accepts “false”, “off”, “no”, “0” to disable, any other value to enable.
-
Return type:
-
None
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!