Inference Engine FAQ

Answers to questions we expect about the Modular Inference Engine.

We tried to anticipate your questions about the Modular Inference Engine on this page. If this page doesn’t answer all your questions, please ask us on our Discord channel.

Motivation

Do we really need yet another inference engine?

We believe so because the AI deployment landscape is littered with tools that provide only a small portion of what AI developers need. They support only a single framework, a single hardware target, a single OS and platform, a small subset of models, or just small- to medium-sized model execution. The result is that developers end up dealing with flaky model converters, constantly rewriting and reoptimizing their models to work with different tools, and have to work across different test and deployment environments. This is why we built the Modular Inference Engine – the only inference engine in the world that doesn’t require you to compromise because it is both general purpose and incredibly fast.

Distribution

When will it be generally available to the public?

The Inference Engine is available today as an early preview for select users. If you are interested in joining the waitlist, please request access to the Inference Engine.

We will share more details about general availability timelines in the near future.

Will it be open-sourced?

We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we have learned that the important thing is to create an inclusive and vibrant community – and that takes a lot of work. We will need to figure out the details, but as we do so, we will share more. Please stay tuned.

Functionality

How do I use the Inference Engine?

Although it’s not generally available today, the Inference Engine APIs are designed so you can execute models trained in TensorFlow or PyTorch on a variety of target hardware. You can read more about our APIs and view sample code in the Inference Engine docs.

What types of models does the Inference Engine support and from which training frameworks?

It currently supports any type of model architecture, from traditional ML algorithms to deep learning architectures. It can run TensorFlow and PyTorch models out of the box, with no conversion required. Support for other model formats, such as models trained using scikit-learn, will be added in the future.

Does the Inference Engine support models in the ONNX format?

Yes, we support ONNX models. We are primarily focused on native TensorFlow and PyTorch support, but if you have models you’ve already converted to ONNX, we can execute those too.

What hardware is currently supported by the Inference Engine?

The engine currently supports all CPU types from major vendors (Intel, AMD, Graviton).

Does the Inference Engine support generic ARM architectures?

Yes, both Mojo and the Modular Inference Engine support generic ARM architectures like Apple ARM chips. We formally benchmark ourselves on Graviton because it’s the most commonly used ARM chip for server deployments, and our benchmarks are designed to match what users use most often in production.

Which operating systems does the Inference Engine support?

The engine currently works on Linux, macOS, and Windows.

Which programming languages does it support? Can I use Mojo?

The Inference Engine APIs are currently available in Python and C, with C++ coming soon. We will be rolling out support for APIs in additional languages in the future. If there are other languages you’d like to see us support, please share in our Discord.

How quickly will it support new model architectures as they become available as part of TensorFlow or PyTorch?

One of the design principles of the Modular Inference Engine is full compatibility with models that can be trained via TensorFlow and PyTorch. Hence, we will quickly update the engine to support new operators as soon as a stable version of the training framework is available.

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

No. We believe forcing developers to deploy and manage different packages for different deployment targets is a major friction factor that shouldn’t exist. The same engine package works regardless of the hardware available in your deployment platform.

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

No. With the Modular Inference Engine you write your code once and deploy anywhere. For example, if you are currently running on an Intel instance in AWS but want to experiment with a Graviton instance, that’s no problem. Just redeploy the same engine package to the Graviton instance, and you’re good to go.

Can I use Modular’s Inference Engine on existing cloud platforms?

Yes. You can deploy the Inference Engine to infrastructure provided by any major cloud platform (AWS, GCP, Azure) via traditional container solutions.

Does the Inference Engine support quantization and sparsity?

It supports quantized models today (models with Int data types) and we will be adding support for sparsity soon.

Will the Inference Engine support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of a single device. Please stay tuned for more details!

Will the Inference Engine support mobile?

We are currently focused on server deployment, but we plan to support deployment to many different platforms, including mobile. We will share more about our mobile support in the future, so stay tuned!

Can I extend the Inference Engine with a new hardware backend?

Yes, you can. The Inference Engine can support other hardware backends, including specialized accelerators, via Mojo, which can talk directly to MLIR and LLVM abstractions (for an example of Mojo talking directly to MLIR, see our low-level IR in Mojo notebook). The exact details about how to add additional hardware support are still being ironed out, but you can read our vision for pluggable hardware.

Can I integrate ops written in Mojo into the Inference Engine?

Yes, Mojo works natively with the Inference Engine. In fact, all of the Inference Engine’s in-house operations are written in Mojo. We will publish more details about how to integrate your Mojo ops with the Inference Engine shortly.

Can the runtime be separated from model compilation (for edge deployment)?

Yes, our runtime (and our entire stack) is designed to be modular. It scales down very well, supports heterogeneous configs, and scales up to distributed settings as well. That being said, we will share more details about various deployment scenarios we support over time.

Performance

Why do you only show performance numbers on the CPU?

Our stack is completely extensible to any type of hardware architecture, from commodity silicon to newer types of AI-specific accelerators. We are starting with CPUs because many real-world inference workloads still heavily depend on CPUs. Our stack currently supports x86-64 CPUs from all major hardware vendors, such as Intel and AMD. It also supports the Graviton CPUs available in AWS as well as RiscV CPUs. We are actively bringing up support for GPU execution, starting with NVIDIA’s GPU. We will share more about our GPU support soon.

Why test with batch size 1?

We started our benchmarking with batch size 1 for a couple reasons: 1) it’s a common batch size for production inference scenarios and 2) it puts runtime efficiencies front-and-center, which helps ensure we are building the most performant possible stack. We have also tested and seen the same relative performance improvements with larger batch sizes. We’ll be releasing those results on our performance dashboard in the near future.

Do you have any benchmarks on GPUs?

We are currently working on adding GPU support. Stay tuned for benchmarks in the near future.

Why are you benchmarking across so many different sequence lengths?

Production NLP deployment scenarios typically involve variable sequence lengths. One of the defining features of the Modular Inference Engine is that it supports full dynamic shapes, meaning that it’s not padding shorter sequence lengths or having to recompile when sequence lengths change. Therefore, we benchmark on a variety of sequence lengths to show the relative speedups you should expect depending on the distribution of sequence lengths in your data.

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

We are excited to bring state of the art innovations across many layers of the AI lifecycle, including layers that are necessary to serve increasingly large AI models on cloud infrastructure. To that end—as mentioned in our launch keynote—we are on a journey to build a next-generation cloud compute platform that will significantly improve server utilization by distributing inference across many nodes. It will effectively scale out and down to meet dynamic changes in traffic volume, and overall significantly reduce the amount of time and effort required to bring up and maintain large models in production. We will share more about our serving platforms later this year, so stay tuned for additional details! If you are interested in this space and have specific requests, please express your interest in the Modular Compute Platform.

Do you also intend to support training?

Right now we are focused on inference because it’s a more fragmented landscape than training and because it is where organizations have a majority of their AI operating expenses. That being said, there’s no reason why the technology we’ve built for the Inference Engine can’t scale to support training workloads with similar performance improvements. Stay tuned!