Get started in C

A walkthrough of the C API, showing how to load and run a trained model.

This is a preview of the Modular AI Engine. It is not publicly available yet and APIs are subject to change.

If you’re interested, please sign up for early access.

Our C API allows you to integrate the Modular AI Engine into your high-performance application code, and run inference on any model from TensorFlow or PyTorch.

This page shows how you can use the C API to load a trained model and execute it with the Modular AI Engine.

We also offer a Python API and a C++ API is coming soon.

Create a runtime context

The first thing you need is an M_RuntimeContext, which is an application level object that sets up various resources such as threadpool and allocators during inference. We recommended you create one context and use it throughout your application.

To create an M_RuntimeContext, you need two other objects:

  • M_RuntimeConfig: This configures details about the runtime context such as the number of threads to use and the logging level.
  • M_Status: This is the object through which the AI Engine passes all error messages.

Here’s how you can create both of these objects and then create the M_RuntimeContext:

M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Notice that this code checks if the M_Status object is an error, using M_isError(), and then exits if it is.

Compile the model

Now you can compile your trained TensorFlow or PyTorch model.

You just need to specify the path to your model by calling M_setModelPath() with an M_CompileConfig object, and then call M_compileModel(), as shown here:

M_CompileConfig *compileConfig = M_newCompileConfig();
const char *modelPath = argv[1];
M_setModelPath(compileConfig, /*path=*/modelPath);
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Compilation happens asynchronously and M_compileModel() returns immediately.

Initialize the model

The M_AsyncCompiledModel returned above is not ready for inference yet. You now need to initialize the model by calling M_initModel(), which returns an instance of M_AsyncModel:

M_AsyncModel *model = M_initModel(context, compiledModel, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

You don’t need to await the M_AsyncCompiledModel (initialized by M_compileModel()) before calling M_initModel() because it internally waits for compilation to complete before initalizing the model.

If you need to wait for M_AsyncCompiledModel, you can insert a call to M_waitForCompilation() before you call M_initModel(). This is the general pattern followed by all Modular AI Engine APIs that accept an asynchronous value as an argument.

M_initModel() is also asynchronous and returns immediately. This step prepares the compiled model for fast execution by running and initializing some of the graph operations that are input-independent.

Prepare input tensors

This example prepares three tensors for the “bert-base-uncased” model, but this is meaningless input that just needs to match the input rank and type (the batch size is dynamic and so is the sequence length; it’s a 2D tensor of variable shape, so this uses batch size 1 and a token length of 5):

// Define the input tensor specs.
const int64_t shape[] = {1, 5};
M_TensorSpec *inputIdsSpec =
    M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"input_ids");
M_TensorSpec *attentionMaskSpec =
    M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"attention_mask");
M_TensorSpec *tokenTypeIdsSpec =
    M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                    /*tensorName=*/"token_type_ids");

// Create the input tensor and borrow it into the model input.
// Borrowing the input means we don't do any copy and caller is responsible
// to make sure that the input stays alive till the inference is completed.
M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
int32_t inputIdsTensor[1][5] = {{101, 7592, 2088, 999, 102}};
M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
int32_t attentionMaskTensor[1][5] = {{1, 1, 1, 1, 1}};
M_borrowTensorInto(inputToModel, attentionMaskTensor, attentionMaskSpec,
                    status);
int32_t tokenTypeIdsTensor[1][5] = {{0, 0, 0, 0, 0}};
M_borrowTensorInto(inputToModel, tokenTypeIdsTensor, tokenTypeIdsSpec,
                    status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Run an inference

Now you’re ready to run an inference:

// Run the inference.
// This function blocks until the inference is complete.
logInfo("Running Inference");
M_AsyncTensorMap *outputs =
    M_executeModelSync(context, model, inputToModel, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

Process the output

Now you can read output tensors by name using M_getTensorByNameFrom(). If you only know the tensor index, you can get the name using M_getTensorNameAt().

logInfo("Inference successfully completed");
M_AsyncTensor *tensor =
    M_getTensorByNameFrom(outputs,
                          /*tensorName=*/"pooler_output", status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}

size_t numElements = M_getTensorNumElements(tensor);
printf("Output length: %zu\n", numElements);

As we mentioned above, this code is just meant to demonstrate the Inference Engine C API, so we don’t process the output here and the results are meaningless.

Clean up

Finally, you need to clean up all the resources.

M_freeTensor(tensor);
M_freeTensorSpec(inputIdsSpec);
M_freeTensorSpec(attentionMaskSpec);
M_freeTensorSpec(tokenTypeIdsSpec);
M_freeAsyncTensorMap(inputToModel);
M_freeAsyncTensorMap(outputs);
M_freeTensorNameArray(tensorNames);

M_freeModel(model);
M_freeCompiledModel(compiledModel);
M_freeCompileConfig(compileConfig);
M_freeRuntimeContext(context);
M_freeRuntimeConfig(runtimeConfig);
M_freeStatus(status);

You can see all the complete program below.

Full example

#include "modular/c/common.h"
#include "modular/c/context.h"
#include "modular/c/model.h"
#include "modular/c/tensor.h"

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

void logHelper(const char *level, const char *message, const char delimiter) {
  printf("%s: %s%c", level, message, delimiter);
}

void logDebug(const char *message) { logHelper("DEBUG", message, ' '); }

void logInfo(const char *message) { logHelper("INFO", message, '\n'); }

void logError(const char *message) { logHelper("ERROR", message, '\n'); }

int main(int argc, char **argv) {
  if (argc != 2) {
    printf("Usage: bert-example <path to bert saved model>");
    return EXIT_FAILURE;
  }

  M_Status *status = M_newStatus();

  M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
  M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  logInfo("Compiling Model");
  M_CompileConfig *compileConfig = M_newCompileConfig();
  const char *modelPath = argv[1];
  M_setModelPath(compileConfig, /*path=*/modelPath);
  M_AsyncCompiledModel *compiledModel =
      M_compileModel(context, compileConfig, status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  logInfo("Initializing Model");
  M_AsyncModel *model = M_initModel(context, compiledModel, status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  M_TensorNameArray *tensorNames = M_getInputNames(compiledModel, status);
  size_t numInputs = M_getNumModelInputs(compiledModel, status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  logDebug("Model input names:");
  for (size_t i = 0; i < numInputs; i++) {
    const char *tensorName = M_getTensorNameAt(tensorNames, i);
    printf("%s ", tensorName);
  }
  printf("\n");

  // Define the input tensor specs.
  const int64_t shape[] = {1, 5};
  M_TensorSpec *inputIdsSpec =
      M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                      /*tensorName=*/"input_ids");
  M_TensorSpec *attentionMaskSpec =
      M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                      /*tensorName=*/"attention_mask");
  M_TensorSpec *tokenTypeIdsSpec =
      M_newTensorSpec(shape, /*rankSize=*/2, /*dtype=*/M_INT32,
                      /*tensorName=*/"token_type_ids");

  // Create the input tensor and borrow it into the model input.
  // Borrowing the input means we don't do any copy and caller is responsible
  // to make sure that the input stays alive till the inference is completed.
  M_AsyncTensorMap *inputToModel = M_newAsyncTensorMap(context);
  int32_t inputIdsTensor[1][5] = {{101, 7592, 2088, 999, 102}};
  M_borrowTensorInto(inputToModel, inputIdsTensor, inputIdsSpec, status);
  int32_t attentionMaskTensor[1][5] = {{1, 1, 1, 1, 1}};
  M_borrowTensorInto(inputToModel, attentionMaskTensor, attentionMaskSpec,
                     status);
  int32_t tokenTypeIdsTensor[1][5] = {{0, 0, 0, 0, 0}};
  M_borrowTensorInto(inputToModel, tokenTypeIdsTensor, tokenTypeIdsSpec,
                     status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  // Run the inference.
  // This function blocks until the inference is complete.
  logInfo("Running Inference");
  M_AsyncTensorMap *outputs =
      M_executeModelSync(context, model, inputToModel, status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  logInfo("Inference successfully completed");
  M_AsyncTensor *tensor =
      M_getTensorByNameFrom(outputs,
                            /*tensorName=*/"pooler_output", status);
  if (M_isError(status)) {
    logError(M_getError(status));
    return EXIT_FAILURE;
  }

  size_t numElements = M_getTensorNumElements(tensor);
  printf("Output length: %zu\n", numElements);

  M_freeTensor(tensor);
  M_freeTensorSpec(inputIdsSpec);
  M_freeTensorSpec(attentionMaskSpec);
  M_freeTensorSpec(tokenTypeIdsSpec);
  M_freeAsyncTensorMap(inputToModel);
  M_freeAsyncTensorMap(outputs);
  M_freeTensorNameArray(tensorNames);

  M_freeModel(model);
  M_freeCompiledModel(compiledModel);
  M_freeCompileConfig(compileConfig);
  M_freeRuntimeContext(context);
  M_freeRuntimeConfig(runtimeConfig);
  M_freeStatus(status);
  return EXIT_SUCCESS;
}

The AI Engine is not yet available to the public, but if you’d like to become an early access partner, please join the waitlist here.