Cloud deployments with Modular
Modular offers two managed deployment paths for teams running production inference at scale: a fully managed service on Modular's infrastructure, and a bring-your-own-cloud option that runs Modular's inference stack inside your own VPC. Both run on MAX's inference engine and support NVIDIA and AMD GPUs.
Modular's managed cloud
Modular's managed cloud gives you full-stack control from kernel to cloud on unified infrastructure. Modular manages the GPUs, networking, and scaling so your team can focus on your application.
Modular handles the following on your behalf:
- Infrastructure provisioning and GPU allocation (NVIDIA and AMD)
- Full graph compilation for maximum performance
- Intelligent batching that adapts to traffic spikes
- Hardware arbitrage that executes on optimal available hardware
- Granular metrics, dashboards, and uptime
- Forward-deployed engineering support for throughput and latency tuning
Deployment options
Modular's managed cloud supports multiple inference patterns:
- Serverless endpoints: Pay by token usage with no idle GPU cost.
- Dedicated endpoints: Reserved GPU capacity billed by time used.
- Custom model inference: Serve fine-tuned or custom model architectures on dedicated endpoints.
- Batch inference: Process large offline workloads efficiently with the
/v1/batchesAPI.
Key features
The managed cloud is built for teams that need performance without operational overhead:
- Full stack control: Kernel-to-cloud deployment on unified infrastructure
- Full customization: Serve your own weights, models, and performance profiles
- Deep observability: Low-level telemetry that reveals bottlenecks in your inference workload
- Portability across accelerators: Run on NVIDIA and AMD without code rewrites
Deploy in your own cloud
The bring-your-own-cloud option runs Modular's inference stack inside your own cloud environment. Your data plane stays inside your VPC, while Modular's control plane handles endpoint lifecycle and scaling.
This option splits responsibilities between two planes:
- Control plane (outside your VPC): Modular manages endpoint lifecycles, auto-scaling policies, monitoring dashboards, and model registration
- Data plane (inside your VPC): You run MAX containers with compiled models and all inference traffic stays within your infrastructure.
Key features
The bring-your-own-cloud option is built for teams with strict data residency or compliance requirements:
- Data residency in your VPC: Inference stays isolated inside your cloud environment
- Multi-cloud and multi-vendor GPU support: Runs on AWS, GCP, Azure, and OCI with NVIDIA, AMD, and Apple Silicon
- Auto-scaling with compiler awareness: Scales workloads based on traffic and model compilation state
- Forward-deployed engineers: Dedicated Modular engineers profile your workloads, identify bottlenecks, and write custom Mojo kernels for your use cases
Other deployment options
Before scaling with managed cloud deployments, you can test your models locally, benchmark performance on your own hardware, or validate your setup with self-hosted endpoints.
- Self-hosted endpoints: Run MAX entirely on your own hardware with a step-by-step guide for AWS, GCP, and Azure.
- Containers: Deploy MAX as a Docker container in your own environment.
- Benchmarking: Benchmark a MAX inference endpoint for metrics such as throughput, latency, and token-processing speed.
Next steps
Reach out to learn more about managed cloud or bring-your-own-cloud options for your organization.
Talk to an AI Expert
Connect with our product experts to explore how we can help you deploy and serve AI models with high performance, scalability, and cost-efficiency.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!