GPU errors
GPU operations normally dispatch asynchronously, so when a kernel fails, the host usually reports the failure at a later op that happens to synchronize next. GPU debugging forces synchronous dispatch and turns on kernel-level assertions so errors surface at the op that caused them.
Force synchronous device executionโ
A GPU kernel is a function that runs on the device. When MAX runs your model, it launches a sequence of kernels on the GPU (one per op). By default, the host enqueues each kernel onto a device stream and continues immediately without waiting for the kernel to finish. This is called asynchronous dispatch, and it's what lets the host and device make progress in parallel.
Async dispatch is fast in the common case but makes debugging painful.
The host only observes a kernel failure at the next synchronization
point (for example, when copying output back to host memory), so the
reported location can be many ops past the one that actually failed.
When you enable device-sync-mode, MAX waits for each kernel to
complete before queuing the next. Throughput drops, but a failure now
surfaces at the exact op boundary that produced it.
Enable device-sync-mode to force synchronous execution:
MODULAR_DEBUG=device-sync-mode max serve --model modularai/Llama-3.1-8B-Instruct-GGUFEnable kernel-level bounds checkingโ
Mojo's standard library ships with assertions that catch common
kernel-authoring bugs, including out-of-bounds accesses on
LayoutTensor, which happen
when a kernel reads or writes outside the allocated region of a tensor. These
assertions are compiled out by default, so a buggy kernel produces corrupted
output downstream instead of a clear error at the kernel site.
The assert-level option controls assertions in the Mojo standard library:
none: no assertions (default, best performance)warn: log warnings for out-of-bounds accessessafe: assertions on bounds checks that are unlikely to trigger in correct codeall: full bounds checking on every access
MODULAR_DEBUG=assert-level=safe max serve --model modularai/Llama-3.1-8B-Instruct-GGUFNext stepsโ
Now that you can pin down GPU failures to the op that caused them, explore related debug options and profiling tools:
- Diagnose runtime errors: When a GPU failure escalates to a crash, capture Mojo stack traces and IR dumps for deeper investigation.
- Trace op execution: Use op-level tracing alongside synchronous dispatch to see the exact sequence that led to the failure.
- GPU system profiling: Move beyond correctness debugging to profile GPU workloads for performance bottlenecks.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!