IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

trace_buf

Zero-overhead per-CTA trace buffer for GPU kernel instrumentation.

A TraceBuf is a kernel-arg–shaped handle to a per-CTA timestamp slot buffer. Implementations:

  • NullTrace is zero-sized; passing it as a kernel argument adds no bytes to the kernel ABI. Its store is pass, so the body of the surrounding comptime if enable_trace: strips entirely at compile time.
  • GmemTrace wraps a single UnsafePointer[UInt64] to a buffer sized for num_blocks * events_per_block slots and records timestamps via PTX globaltimer (lowered from global_perf_counter_ns).

Usage pattern (see nn/gemv_partial_norm.mojo and the SM100 grouped SwiGLU+NVFP4 kernel):

fn my_kernel[..., enable_trace: Bool = False, TraceBufT: TraceBuf](
    ..., trace_buf: TraceBufT
):
    comptime if enable_trace:
        if thread_idx.x == 0:
            trace_buf.store(
                Int(block_idx.x) * EVENTS_PER_BLOCK + role,
                UInt64(global_perf_counter_ns()),
            )

When enable_trace=False (default), every comptime if block strips to nothing and the resulting PTX is byte-identical to a build with no trace plumbing at all.

Structs

Traits

  • TraceBuf: Trace-buffer interface. Implementations: NullTrace, GmemTrace.