Skip to main content

Intro to Serving

Deploying AI at scale is complicated because the tools are fragmented, each with their own trade-offs and limitations. It sometimes seems like an impossible task to figure out which combination of tools and hardware provides the best performance-to-cost tradeoff for any given model. MAX Engine solves these problems by providing a state-of-the-art compiler and inference runtime that works with a variety of models and hardware types.

After you integrate MAX Engine as part of your AI development workflow, your next step is to deploy your models into production, using trustworthy tools that include robust scaling, monitoring, and deployment templates. This part of an AI pipeline is referred to as serving.

With MAX Engine as your backend inference engine, you can serve your model using NVIDIA's Triton Server. NVIDIA's Triton Server is a serving framework that simplifies deploying AI models at scale in production environments. To learn more, see Serving with NVIDIA's Triton Server.