Skip to main content

Run inference with C

Our C API allows you to integrate MAX Engine into your high-performance application code, and run inference with models from TensorFlow, PyTorch, and ONNX.

This page shows how to use the MAX Engine C API to load a model and execute it with MAX Engine.

For a complete code example, check out our GitHub repo.

Create a runtime context

The first thing you need is an M_RuntimeContext, which is an application level object that sets up various resources such as threadpool and allocators during inference. We recommended you create one context and use it throughout your application.

To create an M_RuntimeContext, you need two other objects:

  • M_RuntimeConfig: This configures details about the runtime context such as the number of threads to use and the logging level.
  • M_Status: This is the object through which MAX Engine passes all error messages.

Here's how you can create both of these objects and then create the M_RuntimeContext:

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Notice that this code checks if the M_Status object is an error, using M_isError(), and then exits if it is.

Compile the model

Now you can compile your TensorFlow, PyTorch, or ONNX model.

note

TensorFlow models must be in a SavedModel and PyTorch models must be in a TorchScript. Read more.

Generally, you do that by passing your model path to M_setModelPath(), along with an M_CompileConfig object, and then call M_compileModel().

However, the MAX Engine compiler needs to know the model input shapes, which are not specified in a TorchScript file (they are specified in TF SavedModel and ONNX files). So, you need some extra code if you're loading a TorchScript model, as shown in the following PyTorch tab.

When loading a TensorFlow SavedModel or ONNX model, the M_CompileConfig needs just the model path, via M_setModelPath(). Then, you can call M_compileModel():

// Set the model path
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, /*path=*/modelPath);

// Compile the model
M_AsyncCompiledModel *compiledModel =
M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

MAX Engine now begins compiling the model asynchronously; M_compileModel() returns immediately. Note that an M_CompileConfig can only be used for a single compilation call. Any subsequent calls require a new M_CompileConfig.

Initialize the model

The M_AsyncCompiledModel returned by M_compileModel() is not ready for inference yet. You now need to initialize the model by calling M_initModel(), which returns an instance of M_AsyncModel.

This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.

M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

You don't need to wait for M_compileModel() to return before calling M_initModel(), because it internally waits for compilation to finish. If you want to wait, add a call to M_waitForCompilation() before you call M_initModel(). This is the general pattern followed by all MAX Engine APIs that accept an asynchronous value as an argument.

M_initModel() is also asynchronous and returns immediately. If you want to wait for it to finish, add a call to M_waitForModel().

Prepare input tensors

The last step before you run an inference is to move each input tensor into a single M_AsyncTensorMap. You can add each input by calling M_borrowTensorInto(), passing it the input tensor and the corresponding tensor specification (shape, type, etc) as an M_TensorSpec.

// Define the tensor spec
int64_t *inputIdsShape =
(int64_t *)readFileOrExit("inputs/input_ids_shape.bin");
M_TensorSpec *inputIdsSpec =
M_newTensorSpec(inputIdsShape, /*rankSize=*/2, /*dtype=*/M_INT32,
/*tensorName=*/"input_ids");
free(inputIdsShape);

// Create the tensor map
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
// Add an input to the tensor map
int32_t *inputIdsTensor = (int32_t *)readFileOrExit("inputs/input_ids.bin");
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Run an inference

Now you're ready to run an inference with M_executeModelSync():

M_AsyncTensorMap *outputs =
M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

Process the output

The output is returned in an M_AsyncTensorMap, and you can get individual outputs from it with M_getTensorByNameFrom().

M_AsyncTensor *logits =
M_getTensorByNameFrom(outputs,
/*tensorName=*/"logits", status);
if (M_isError(status)) {
logError(M_getError(status));
return EXIT_FAILURE;
}

If you don't know the tensor name, you can get it from M_getTensorNameAt().

Clean up

That's it! Don't forget to free all the things—see the types reference to find each free function.

For a complete code example, check out our GitHub repo. In particular, see the bert-c-tensorflow and bert-c-torchscript examples.