Skip to main content

Cloud deployments with Modular

Modular offers two managed deployment paths for teams running production inference at scale: a fully managed service on Modular's infrastructure, and a bring-your-own-cloud option that runs Modular's inference stack inside your own VPC. Both run on MAX's inference engine and support NVIDIA and AMD GPUs.

Modular's managed cloud

Modular's managed cloud gives you full-stack control from kernel to cloud on unified infrastructure. Modular manages the GPUs, networking, and scaling so your team can focus on your application.

Modular handles the following on your behalf:

  • Infrastructure provisioning and GPU allocation (NVIDIA and AMD)
  • Full graph compilation for maximum performance
  • Intelligent batching that adapts to traffic spikes
  • Hardware arbitrage that executes on optimal available hardware
  • Granular metrics, dashboards, and uptime
  • Forward-deployed engineering support for throughput and latency tuning

Deployment options

Modular's managed cloud supports multiple inference patterns:

  • Serverless endpoints: Pay by token usage with no idle GPU cost.
  • Dedicated endpoints: Reserved GPU capacity billed by time used.
  • Custom model inference: Serve fine-tuned or custom model architectures on dedicated endpoints.
  • Batch inference: Process large offline workloads efficiently with the /v1/batches API.

Key features

The managed cloud is built for teams that need performance without operational overhead:

  • Full stack control: Kernel-to-cloud deployment on unified infrastructure
  • Full customization: Serve your own weights, models, and performance profiles
  • Deep observability: Low-level telemetry that reveals bottlenecks in your inference workload
  • Portability across accelerators: Run on NVIDIA and AMD without code rewrites

Deploy in your own cloud

The bring-your-own-cloud option runs Modular's inference stack inside your own cloud environment. Your data plane stays inside your VPC, while Modular's control plane handles endpoint lifecycle and scaling.

This option splits responsibilities between two planes:

  • Control plane (outside your VPC): Modular manages endpoint lifecycles, auto-scaling policies, monitoring dashboards, and model registration
  • Data plane (inside your VPC): You run MAX containers with compiled models and all inference traffic stays within your infrastructure.

Key features

The bring-your-own-cloud option is built for teams with strict data residency or compliance requirements:

  • Data residency in your VPC: Inference stays isolated inside your cloud environment
  • Multi-cloud and multi-vendor GPU support: Runs on AWS, GCP, Azure, and OCI with NVIDIA, AMD, and Apple Silicon
  • Auto-scaling with compiler awareness: Scales workloads based on traffic and model compilation state
  • Forward-deployed engineers: Dedicated Modular engineers profile your workloads, identify bottlenecks, and write custom Mojo kernels for your use cases

Other deployment options

Before scaling with managed cloud deployments, you can test your models locally, benchmark performance on your own hardware, or validate your setup with self-hosted endpoints.

  • Self-hosted endpoints: Run MAX entirely on your own hardware with a step-by-step guide for AWS, GCP, and Azure.
  • Containers: Deploy MAX as a Docker container in your own environment.
  • Benchmarking: Benchmark a MAX inference endpoint for metrics such as throughput, latency, and token-processing speed.

Next steps

Reach out to learn more about managed cloud or bring-your-own-cloud options for your organization.

Get the latest updates

Stay up to date on Modular’s updates and key feature releases. We’re moving fast over here.

Talk to an AI Expert

Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.

Request a demo

Was this page helpful?