Skip to main content

MAX FAQ

We tried to anticipate your questions about MAX Engine on this page. If this page doesn't answer all your questions, please ask us on our Discord channel.

Motivation

Do we really need yet another inference engine?

We believe so, because the AI deployment landscape is littered with tools that provide only a small portion of what AI developers need. They support only a single framework, a single hardware target, a single OS and platform, a small subset of models, or just small- to medium-sized model execution. The result is that developers end up dealing with flaky model converters, constantly rewriting and re-optimizing their models to work with different tools, and have to work across different test and deployment environments. This is why we built MAX Engine – the only inference engine in the world that doesn’t require you to compromise because it is both general purpose and incredibly fast.

Distribution

When will it be generally available to the public?

The Developer Edition is available now! Get started here.

Very soon, we will also release the commercially-licensed Enterprise Edition. Sign up for updates.

Will it be open-sourced?

We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we have learned that the important thing is to create an inclusive and vibrant community – and that takes a lot of work. We will need to figure out the details, but as we do so, we will share more. Please stay tuned.

Why bundle Mojo with MAX?

By integrating all our products into a unified solution, we're simplifying the installation process for all our AI infrastructure tools within MAX. Mojo is designed to work seamlessly with these tools, enhancing interoperability and user experience.

Critically, nothing has changed for Mojo developers—you can still build and develop in Mojo like you always have—but now many AI developers can also realize the benefits of Mojo + MAX Engine. This consolidation not only streamlines the deployment process but also ensures a more cohesive and efficient workflow for MAX developers, enabling them to leverage the full potential of our AI ecosystem.

The MAX package seems large, why?

Sorry about that, we’re doing everything we can to reduce it. We currently include a number of third party tools, some of which are very large. We're working reduce the number of packaged dependencies in future releases.

Functionality

How do I use MAX Engine?

You can execute any model trained in TensorFlow or PyTorch using our Python and C API libraries. You can read more about our APIs and view sample code in the MAX Engine docs.

Why does MAX Engine take so long to "load" a model?

The first time you load your model (such as with the Python load() function), MAX Engine must first compile the model.

This might seem strange if you're used to "eager execution" in PyTorch or TensorFlow, but this compilation step is how MAX Engine optimizes the graph to deliver more performance. This is an up-front cost that occurs only when you first load the model, and it pays dividends with major latency savings provided by our next-generation graph compiler. So it's worth the wait. :)

What types of models does MAX Engine support and from which training frameworks?

MAX Engine supports most models build with TensorFlow (in SavedModel format) and PyTorch (in TorchScript format), and also ONNX models.

What hardware is currently supported by MAX Engine?

MAX Engine currently supports all CPU types from major vendors (Intel, AMD, Graviton).

Support for NVIDIA GPUs is in the works.

Does MAX Engine support generic ARM architectures?

Yes, but we officially support only Graviton because it’s the most commonly used ARM chip for server deployments, and our benchmarks are designed to match what users use most often in production.

We are working to add support for MAX Engine development on other ARM systems, such as Apple Silicon (M1/M2). (You can already get Mojo for Mac).

Which operating systems does MAX Engine support?

MAX Engine currently supports Linux. In future releases, we'll add support for macOS and Windows.

Which programming languages does it support? Can I use Mojo?

We currently provide MAX Engine API bindings in Python, C, and—yes—Mojo. If there are other languages you’d like to see us support, please share in our Discord.

How quickly will it support new model architectures as they become available as part of TensorFlow or PyTorch?

One of the design principles of MAX Engine is full compatibility with models that can be trained via TensorFlow and PyTorch. Hence, we will quickly support new operators as soon as a stable version of the training framework is available.

Will I need different packages depending on the platform or hardware where my code needs to be deployed?

No. We believe forcing developers to deploy and manage different packages for different deployment targets is a major friction factor that shouldn’t exist. The same engine package works regardless of the hardware available in your deployment platform.

Will I need to change my code if I want to switch deployment to hardware from a different manufacturer?

No. With MAX Engine, you write your code once and deploy anywhere. For example, if you are currently running on an Intel instance in AWS but want to experiment with a Graviton instance, that’s no problem. Just redeploy the same engine package to the Graviton instance, and you’re good to go.

Can I use Modular’s MAX Engine on existing cloud platforms?

Yes. You can deploy MAX Engine to infrastructure provided by any major cloud platform (AWS, GCP, Azure) via traditional container solutions. For more information, read about MAX Serving.

Does MAX Engine support quantization and sparsity?

It supports some quantized models today (models with Int data types, and we are working to add support for more) and we will be adding support for sparsity soon.

Will MAX Engine support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of a single device. This isn't available yet, so stay tuned!

Will MAX Engine support mobile?

We are currently focused on server deployment, but we plan to support deployment to many different platforms, including mobile. We will share more about our mobile support in the future, so stay tuned!

Can I extend MAX Engine with a new hardware backend?

Yes, you can. MAX Engine can support other hardware backends, including specialized accelerators, via Mojo, which can talk directly to MLIR and LLVM abstractions (for an example of Mojo talking directly to MLIR, see our low-level IR in Mojo notebook). The exact details about how to add additional hardware support are still being ironed out, but you can read our vision for pluggable hardware.

Can I integrate ops written in Mojo into MAX Engine?

Yes, Mojo works natively with MAX Engine! In fact, all of the MAX Engine’s in-house operations are written in Mojo. In an upcoming release, you'll also be able to write your own custom ops in Mojo.

Can the runtime be separated from model compilation (for edge deployment)?

Yes, our runtime (and our entire stack) is designed to be modular. It scales down very well, supports heterogeneous configs, and scales up to distributed settings as well. That being said, this isn't available yet, but we'll share details about more deployment scenarios we support over time.

Performance

How can I see some real performance numbers?

Just install the MAX SDK and then use our benchmark tool!

We've also created and interactive dashboard where you can select from a number of industry-standard models and production-grade compute instances to see our real inference performance compared to other frameworks. Take a look at performance.modular.com.

Why does MAX Engine perform slowly on my computer compared to PyTorch or TensorFlow?

So far, we've been focused on building optimizations for data center CPUs. In many cases, these optimizations carry over to desktop x86 CPUs. However, modern desktop/laptop CPUs include specialty compute cores not found in data center CPUs, such as performance cores ("P cores," which are often paired with efficiency "E cores" in an "asymmetric core"). If your CPU includes these P/E cores, MAX Engine simply doesn't use them thoughtfully yet. We have plans to fix this.

Additionally, we haven't optimized for systems with 32+ cores yet. This will also be fixed soon.

If you're seeing slow results that aren't explained here, please let us know.

Why do you only show performance numbers on the CPU?

Our stack is completely extensible to any type of hardware architecture, from commodity silicon to newer types of AI-specific accelerators. We are starting with CPUs because many real-world inference workloads still heavily depend on CPUs. Our stack currently supports x86-64 CPUs from all major hardware vendors, such as Intel and AMD. It also supports the Graviton CPUs available in AWS. We are actively bringing up support for GPU execution, starting with NVIDIA’s GPU. We will share more about our GPU support soon.

Why test with batch size 1?

We started our benchmarking with batch size 1 for a couple reasons: 1) it’s a common batch size for production inference scenarios and 2) it puts runtime efficiencies front-and-center, which helps ensure we are building the most performant possible stack. We have also tested and seen the same relative performance improvements with larger batch sizes. We’ll be releasing those results on our performance dashboard in the near future.

Do you have any benchmarks on GPUs?

We are currently working on adding GPU support. Stay tuned for benchmarks in the near future.

Why are you benchmarking across so many different sequence lengths?

Production NLP deployment scenarios typically involve variable sequence lengths. One of the defining features of MAX Engine is that it supports full dynamic shapes, meaning that it’s not padding shorter sequence lengths or having to recompile when sequence lengths change. Therefore, we benchmark on a variety of sequence lengths to show the relative speedups you should expect depending on the distribution of sequence lengths in your data.

Future work

Your launch keynote also mentioned a “cloud serving platform” – what’s the deal with that?

We are excited to bring state of the art innovations across many layers of the AI lifecycle, including layers that are necessary to serve increasingly large AI models on cloud infrastructure. To that end—as mentioned in our launch keynote—we are on a journey to build a next-generation cloud compute platform that will significantly improve server utilization by distributing inference across many nodes. It will effectively scale out and down to meet dynamic changes in traffic volume, and overall significantly reduce the amount of time and effort required to bring up and maintain large models in production. Stay tuned for more information about our plans here.

Do you also intend to support training?

Right now we are focused on inference because it’s a more fragmented landscape than training and because it is where organizations have a majority of their AI operating expenses. That being said, there’s no reason why the technology we’ve built for MAX Engine can’t scale to support training workloads with similar performance improvements. Stay tuned!