Run inference with C
Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from PyTorch and ONNX.
This tutorial shows how to use the MAX Engine C API to load a BERT model and run inference. We'll walk through a complete example that demonstrates loading a model, preparing inputs, and executing inference.
Create a virtual environment
Using a virtual environment ensures that you have the Python version and packages that are compatible with this project. We'll use the Magic CLI to create the environment and install the required packages.
If you don't have the magic
CLI yet, you can install it on macOS
and Ubuntu Linux with this command:
curl -ssL https://magic.modular.com/ | bash
curl -ssL https://magic.modular.com/ | bash
Then run the source
command that's printed in your terminal.
Initialize the runtime context
The first step in using the MAX Engine C API is initializing the runtime context. This context manages resources like thread pools and memory allocators that are needed during inference.
Create a new file called main.c
. We'll need to create three key objects:
// Helper macro for error checking
#define CHECK(x)
if (M_isError(x)) {
logError(M_getError(x));
return EXIT_FAILURE;
}
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
CHECK(status);
// Helper macro for error checking
#define CHECK(x)
if (M_isError(x)) {
logError(M_getError(x));
return EXIT_FAILURE;
}
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
CHECK(status);
M_RuntimeContext
is an
application level object that sets up various resources such as threadpool and
allocators during inference. We recommended you create one context and use it
throughout your application.
M_RuntimeContext
requires two objects:
M_RuntimeConfig
: This configures details about the runtime context such as the number of threads to use and the logging level.M_Status
: This is the object through which MAX Engine passes all error messages.
Notice that this code checks if the M_Status
object is an error, using
M_isError()
, and then
exits if it is.
Compile the model
After initializing the runtime, you'll need to compile your model. MAX Engine supports both PyTorch's TorchScript format and ONNX models. The process differs slightly depending on your model format.
To compile the model, pass your model path to
M_setModelPath()
,
along with an M_CompileConfig
object. Then call M_compileModel()
.
- PyTorch
- ONNX
PyTorch models require additional input shape specifications since these aren't included in TorchScript format.
// Create compilation config and set model path
logInfo("Compiling Model");
M_CompileConfig *compileConfig = M_newCompileConfig();
const char *modelPath = argv[1];
M_setModelPath(compileConfig, /*path=*/modelPath);
// Define input specifications for PyTorch model
// Input IDs specification
int64_t inputShape[] = {1, 512}; // Example shape for BERT-like model
M_TorchInputSpec *inputSpec = M_newTorchInputSpec(
inputShape, // Shape array
/*dimNames=*/NULL, // Dimension names (optional)
/*rankSize=*/2, // Number of dimensions
/*dtype=*/M_INT32, // Data type
status
);
CHECK(status);
// Attention mask specification
int64_t maskShape[] = {1, 512};
M_TorchInputSpec *maskSpec = M_newTorchInputSpec(
maskShape,
/*dimNames=*/NULL,
/*rankSize=*/2,
/*dtype=*/M_INT32,
status
);
CHECK(status);
// Set input specifications for compilation
M_TorchInputSpec *inputSpecs[] = {inputSpec, maskSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, /*numInputs=*/2);
// Compile and initialize the model
M_AsyncCompiledModel *compiledModel = M_compileModel(
context,
&compileConfig,
status
);
CHECK(status);
// Create compilation config and set model path
logInfo("Compiling Model");
M_CompileConfig *compileConfig = M_newCompileConfig();
const char *modelPath = argv[1];
M_setModelPath(compileConfig, /*path=*/modelPath);
// Define input specifications for PyTorch model
// Input IDs specification
int64_t inputShape[] = {1, 512}; // Example shape for BERT-like model
M_TorchInputSpec *inputSpec = M_newTorchInputSpec(
inputShape, // Shape array
/*dimNames=*/NULL, // Dimension names (optional)
/*rankSize=*/2, // Number of dimensions
/*dtype=*/M_INT32, // Data type
status
);
CHECK(status);
// Attention mask specification
int64_t maskShape[] = {1, 512};
M_TorchInputSpec *maskSpec = M_newTorchInputSpec(
maskShape,
/*dimNames=*/NULL,
/*rankSize=*/2,
/*dtype=*/M_INT32,
status
);
CHECK(status);
// Set input specifications for compilation
M_TorchInputSpec *inputSpecs[] = {inputSpec, maskSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, /*numInputs=*/2);
// Compile and initialize the model
M_AsyncCompiledModel *compiledModel = M_compileModel(
context,
&compileConfig,
status
);
CHECK(status);
The M_CompileConfig
takes the model path
set by M_setModelPath()
.
The M_setTorchInputSpecs()
takes the input spec: shape, rank, and types.
Because the TorchScript model does not include metadata about the input specs,
this code loads the input shapes from .bin
files that were generated earlier.
You can see an example of how to generate these files in our download-model.py
script for bert-c-torchscript on
GitHub.
ONNX models include input shape information, making the process simpler:
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);
// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);
// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
The M_CompileConfig
needs just the model
path set by M_setModelPath()
. Once set, call
M_compileModel()
to compile the model.
MAX Engine now begins compiling the model asynchronously; M_compileModel()
returns immediately.
Initialize the model
Now that the model is compiled, you can initialize the model.
Call M_initModel()
, which returns an
instance of M_AsyncModel
This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.
M_AsyncModel *model = M_initModel(
context,
compiledModel,
/*weightsRegistry=*/NULL,
status
);
CHECK(status);
// Wait for compilation to complete
logInfo("Waiting for model compilation to finish");
M_waitForModel(model, status);
CHECK(status);
M_AsyncModel *model = M_initModel(
context,
compiledModel,
/*weightsRegistry=*/NULL,
status
);
CHECK(status);
// Wait for compilation to complete
logInfo("Waiting for model compilation to finish");
M_waitForModel(model, status);
CHECK(status);
You don't need to wait for
M_compileModel()
to return before
calling M_initModel()
, because internally it waits for compilation to finish.
If you want to wait, add a call to
M_waitForCompilation()
before you
call M_initModel()
. This is the general pattern followed by all MAX Engine APIs
that accept an asynchronous value as an argument.
M_initModel()
is also asynchronous and
returns immediately. If you want to wait for it to finish, add a call to
M_waitForModel()
.
Prepare input tensors
Before running inference, you need to prepare your input data in the format
expected by the model. This involves creating an
M_AsyncTensorMap
and adding your input
tensors:
// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);
// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);
// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
Add each input by calling
M_borrowTensorInto()
, passing it the
input tensor and the corresponding tensor specification (shape, type, etc) as an
M_TensorSpec
.
Run inference
With your input data prepared, you can now run inference with
M_executeModelSync()
:
logInfo("Running Inference...");
M_AsyncTensorMap *outputs = M_executeModelSync(context, model, inputToModel, status);
CHECK(status);
M_AsyncValue *resultValue = M_getValueByNameFrom(outputs, "result0", status);
CHECK(status);
logInfo("Running Inference...");
M_AsyncTensorMap *outputs = M_executeModelSync(context, model, inputToModel, status);
CHECK(status);
M_AsyncValue *resultValue = M_getValueByNameFrom(outputs, "result0", status);
CHECK(status);
Process the output
After inference completes, you'll need to process the output tensors:
logInfo("Extracting output values");
M_AsyncTensor *result = M_getTensorFromValue(resultValue);
size_t numElements = M_getTensorNumElements(result);
printf("Tensor size: %ld\n", numElements);
M_Dtype dtype = M_getTensorType(result);
// Save output to file
const char *outputFilePath = "outputs.bin";
FILE *file = fopen(outputFilePath, "wb");
if (!file) {
printf("failed to open %s. Aborting.\n", outputFilePath);
return EXIT_FAILURE;
}
fwrite(M_getTensorData(result), M_sizeOf(dtype), numElements, file);
fclose(file);
logInfo("Extracting output values");
M_AsyncTensor *result = M_getTensorFromValue(resultValue);
size_t numElements = M_getTensorNumElements(result);
printf("Tensor size: %ld\n", numElements);
M_Dtype dtype = M_getTensorType(result);
// Save output to file
const char *outputFilePath = "outputs.bin";
FILE *file = fopen(outputFilePath, "wb");
if (!file) {
printf("failed to open %s. Aborting.\n", outputFilePath);
return EXIT_FAILURE;
}
fwrite(M_getTensorData(result), M_sizeOf(dtype), numElements, file);
fclose(file);
The output is returned in an
M_AsyncTensorMap
, and you can get
individual outputs from it with
M_getTensorByNameFrom()
.
If you don't know the tensor name, you can get it from
M_getTensorNameAt()
.
Clean up
In this guide, you learned how to use the MAX Engine C API to run machine learning inference in C applications. You now know how to initialize the runtime environment, load models, prepare input data, execute inference, and process results all in C.
Don't forget to free all the things—see the types
reference to find each free
function.
For more example code, see our GitHub repo.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!