MAX Engine FAQ

Answers to questions we expect about MAX Engine.

We tried to anticipate your questions about MAX Engine on this page. If this page doesn’t answer all your questions, please ask us on our Discord channel.

Motivation

Do we really need yet another inference engine?

We believe so because the AI deployment landscape is littered with tools that provide only a small portion of what AI developers need. They support only a single framework, a single hardware target, a single OS and platform, a small subset of models, or just small- to medium-sized model execution. The result is that developers end up dealing with flaky model converters, constantly rewriting and re-optimizing their models to work with different tools, and have to work across different test and deployment environments. This is why we built MAX Engine – the only inference engine in the world that doesn’t require you to compromise because it is both general purpose and incredibly fast.

Distribution

When will it be generally available to the public?

It will be available in a free Developer Edition and a commercial Enterprise Edition in Q1 2024. Sign up for updates.

Will it be open-sourced?

We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we have learned that the important thing is to create an inclusive and vibrant community – and that takes a lot of work. We will need to figure out the details, but as we do so, we will share more. Please stay tuned.

Functionality

How do I use MAX Engine?

You can execute any model trained in TensorFlow or PyTorch using our Python and C API libraries. You can read more about our APIs and view sample code in the MAX Engine docs.

What types of models does MAX Engine support and from which training frameworks?

It currently supports any type of model architecture, from traditional ML algorithms to deep learning architectures. It can run TensorFlow and PyTorch models out of the box, with no conversion required. Support for other model formats, such as models trained using scikit-learn, will be added in the future.

Does MAX Engine support models in the ONNX format?

Yes, we support ONNX models. We are primarily focused on native TensorFlow and PyTorch support, but if you have models you’ve already converted to ONNX, we can execute those too.

What hardware is currently supported by MAX Engine?

MAX Engine currently supports all CPU types from major vendors (Intel, AMD, Graviton).

Does MAX Engine support generic ARM architectures?

Yes, both Mojo and MAX Engine support generic ARM architectures like Apple ARM chips. We formally benchmark ourselves on Graviton because it’s the most commonly used ARM chip for server deployments, and our benchmarks are designed to match what users use most often in production.

Which operating systems does MAX Engine support?

MAX Engine currently supports Linux. In future releases, we’ll add support for macOS and Windows.

Which programming languages does it support? Can I use Mojo?

The MAX Engine APIs are currently available in Python and C, with Mojo coming soon. We will be rolling out support for APIs in additional languages in the future. If there are other languages you’d like to see us support, please share in our Discord.

How quickly will it support new model architectures as they become available as part of TensorFlow or PyTorch?

One of the design principles of MAX Engine is full compatibility with models that can be trained via TensorFlow and PyTorch. Hence, we will quickly support new operators as soon as a stable version of the training framework is available.

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

No. We believe forcing developers to deploy and manage different packages for different deployment targets is a major friction factor that shouldn’t exist. The same engine package works regardless of the hardware available in your deployment platform.

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

No. With MAX Engine, you write your code once and deploy anywhere. For example, if you are currently running on an Intel instance in AWS but want to experiment with a Graviton instance, that’s no problem. Just redeploy the same engine package to the Graviton instance, and you’re good to go.

Can I use Modular’s MAX Engine on existing cloud platforms?

Yes. You can deploy MAX Engine to infrastructure provided by any major cloud platform (AWS, GCP, Azure) via traditional container solutions. For more information, read about MAX Serving.

Does MAX Engine support quantization and sparsity?

It supports quantized models today (models with Int data types) and we will be adding support for sparsity soon.

Will MAX Engine support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of a single device. Please stay tuned for more details!

Will MAX Engine support mobile?

We are currently focused on server deployment, but we plan to support deployment to many different platforms, including mobile. We will share more about our mobile support in the future, so stay tuned!

Can I extend MAX Engine with a new hardware backend?

Yes, you can. MAX Engine can support other hardware backends, including specialized accelerators, via Mojo, which can talk directly to MLIR and LLVM abstractions (for an example of Mojo talking directly to MLIR, see our low-level IR in Mojo notebook). The exact details about how to add additional hardware support are still being ironed out, but you can read our vision for pluggable hardware.

Can I integrate ops written in Mojo into MAX Engine?

Yes, Mojo works natively with MAX Engine. In fact, all of the MAX Engine’s in-house operations are written in Mojo. We will publish more details about how to integrate your Mojo ops with MAX Engine shortly.

Can the runtime be separated from model compilation (for edge deployment)?

Yes, our runtime (and our entire stack) is designed to be modular. It scales down very well, supports heterogeneous configs, and scales up to distributed settings as well. That being said, we will share more details about various deployment scenarios we support over time.

Performance

How can I see some real performance numbers?

When MAX Engine is available in Q1 2024, it will include benchmarking tools. Until then, we’ve built an interactive dashboard where you can select from a number of industry-standard models and production-grade compute instances to see our real inference performance compared to other frameworks. Take a look at performance.modular.com.

Why do you only show performance numbers on the CPU?

Our stack is completely extensible to any type of hardware architecture, from commodity silicon to newer types of AI-specific accelerators. We are starting with CPUs because many real-world inference workloads still heavily depend on CPUs. Our stack currently supports x86-64 CPUs from all major hardware vendors, such as Intel and AMD. It also supports the Graviton CPUs available in AWS as well as RiscV CPUs. We are actively bringing up support for GPU execution, starting with NVIDIA’s GPU. We will share more about our GPU support soon.

Why test with batch size 1?

We started our benchmarking with batch size 1 for a couple reasons: 1) it’s a common batch size for production inference scenarios and 2) it puts runtime efficiencies front-and-center, which helps ensure we are building the most performant possible stack. We have also tested and seen the same relative performance improvements with larger batch sizes. We’ll be releasing those results on our performance dashboard in the near future.

Do you have any benchmarks on GPUs?

We are currently working on adding GPU support. Stay tuned for benchmarks in the near future.

Why are you benchmarking across so many different sequence lengths?

Production NLP deployment scenarios typically involve variable sequence lengths. One of the defining features of MAX Engine is that it supports full dynamic shapes, meaning that it’s not padding shorter sequence lengths or having to recompile when sequence lengths change. Therefore, we benchmark on a variety of sequence lengths to show the relative speedups you should expect depending on the distribution of sequence lengths in your data.

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

We are excited to bring state of the art innovations across many layers of the AI lifecycle, including layers that are necessary to serve increasingly large AI models on cloud infrastructure. To that end—as mentioned in our launch keynote—we are on a journey to build a next-generation cloud compute platform that will significantly improve server utilization by distributing inference across many nodes. It will effectively scale out and down to meet dynamic changes in traffic volume, and overall significantly reduce the amount of time and effort required to bring up and maintain large models in production. Stay tuned for more information about our plans here.

Do you also intend to support training?

Right now we are focused on inference because it’s a more fragmented landscape than training and because it is where organizations have a majority of their AI operating expenses. That being said, there’s no reason why the technology we’ve built for MAX Engine can’t scale to support training workloads with similar performance improvements. Stay tuned!