Run inference with C
Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from PyTorch and ONNX.
This page shows how to use the MAX Engine C API to load a model and execute it with MAX Engine.
For a complete code example, check out our GitHub repo.
Create a runtime context
The first thing you need is an
M_RuntimeContext
,
which is an application level object that sets up various resources such as
threadpool and allocators during inference. We recommended you create one
context and use it throughout your application.
To create an M_RuntimeContext
, you need two other objects:
M_RuntimeConfig
: This configures details about the runtime context such as the number of threads to use and the logging level.M_Status
: This is the object through which MAX Engine passes all error messages.
Here's how you can create both of these objects and then create the
M_RuntimeContext
:
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
Notice that this code checks if the M_Status
object is an error, using
M_isError()
,
and then exits if it is.
Compile the model
Now you can compile your PyTorch or ONNX model.
PyTorch models must be in TorchScript format. Read more.
Generally, you do that by passing your model path to
M_setModelPath()
,
along with an M_CompileConfig
object, and then call
M_compileModel()
.
However, the MAX Engine compiler needs to know the model input shapes, which are not specified in a TorchScript file (they are specified in TF SavedModel and ONNX files). So, you need some extra code if you're loading a TorchScript model, as shown in the following PyTorch tab.
- PyTorch
- ONNX
If you're using a PyTorch model (it must be in TorchScript
format), the
M_CompileConfig
needs the
model path, via
M_setModelPath()
, and the input
specs (shape, rank, and types), via
M_setTorchInputSpecs()
.
Although you must specify all input shapes, the shapes can be dynamic: use
M_getDynamicDimensionValue()
for any dimension size that's dynamic. For more detail, see
M_newTorchInputSpec()
.
Here's an abbreviated example:
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);
// Create torch input specs
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TorchInputSpec *inputIdsInputSpec =
M_newTorchInputSpec(inputIdsShape, /*dimNames=*/NULL, /*rankSize=*/2,
/*dtype=*/M_INT32, status);
// ... Similar code here to also create M_TorchInputSpec for
// attentionMaskInputSpec and tokenTypeIdsInputSpec
// Set the input specs
M_TorchInputSpec *inputSpecs[3] = {inputIdsInputSpec, attentionMaskInputSpec,
tokenTypeIdsInputSpec};
M_setTorchInputSpecs(compileConfig, inputSpecs, 3);
// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
Because the TorchScript model does not include metadata about the input specs,
this code loads the input shapes from .bin
files that were generated earlier.
You can see an example of how to generate these files in our
download-model.py
script for bert-c-torchscript on
GitHub.
If you're using an ONNX model, the
M_CompileConfig
needs just
the model path, via
M_setModelPath()
. Then, you can
call M_compileModel()
:
// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);
// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
MAX Engine now begins compiling the model asynchronously; M_compileModel()
returns immediately. Note that an M_CompileConfig
can only be used for a
single compilation call. Any subsequent calls require a new M_CompileConfig
.
Initialize the model
The
M_AsyncCompiledModel
returned by M_compileModel()
is not ready for inference yet. You now need to initialize the model by calling
M_initModel()
, which returns an
instance of M_AsyncModel
.
This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.
M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
You don't need to wait for
M_compileModel()
to return before calling M_initModel()
, because it internally waits for
compilation to finish. If you want to wait, add a call to
M_waitForCompilation()
before you call M_initModel()
. This is the general pattern followed by all
MAX Engine APIs that accept an asynchronous value as an argument.
M_initModel()
is also
asynchronous and returns immediately. If you want to wait for it to finish, add
a call to M_waitForModel()
.
Prepare input tensors
The last step before you run an inference is to move each input tensor into a
single M_AsyncTensorMap
. You
can add each input by calling
M_borrowTensorInto()
,
passing it the input tensor and the corresponding tensor specification (shape,
type, etc) as an M_TensorSpec
.
// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);
// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
Run an inference
Now you're ready to run an inference with
M_executeModelSync()
:
M_AsyncTensorMap *outputs =
M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
Process the output
The output is returned in an
M_AsyncTensorMap
, and you can
get individual outputs from it with
M_getTensorByNameFrom()
.
M_AsyncTensor *logits =
M_getTensorByNameFrom(outputs,
/*tensorName=*/"logits", status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}
If you don't know the tensor name, you can get it from
M_getTensorNameAt()
.
Clean up
That's it! Don't forget to free all the things—see the types
reference to find each free
function.
For a complete code example, check out our GitHub repo. In particular, see the bert-c-torchscript example.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!
If you'd like to share more information, please report an issue on GitHub
😔 What went wrong?