Mojo trait

SwiGLUOutput

Trait for fused SwiGLU + NVFP4 output destinations.

Implementations:

NullSwiGLUOutput: zero-sized no-op for the BF16-output (non-fused) mode. Every method body compiles away; struct contributes 0 kernel-arg bytes.
RealSwiGLUOutput: carries packed-NVFP4 output tensor, 5D FP8-E4M3 scale tile, and per-active-expert input scales.

Trace instrumentation lives on the TraceBufT kernel parameter (see structured_kernels/trace_buf.mojo), not on this trait — keeps the output destination concern separate from observability.

Implemented traits

AnyType, Copyable, DevicePassable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

`comptime` members

`device_type`

comptime device_type

Indicate the type being used on accelerator devices.

Required methods

`init`

__init__(out self: _Self, *, copy: _Self)

Create a new instance of the value by copying an existing one.

Args:

copy (_Self): The value to copy.

Returns:

_Self

__init__(out self: _Self, *, deinit take: _Self)

Create a new instance of the value by moving the value of another.

Args:

take (_Self): The value to move.

Returns:

_Self

`store_packed_byte`

store_packed_byte(self: _Self, m: Int, byte_pos: Int, val: UInt8)

Store one packed-NVFP4 byte (= 2 nibbles) at GMEM (m, byte_pos).

`store_packed_word`

store_packed_word(self: _Self, m: Int, byte_pos: Int, val: UInt32)

Store one packed-NVFP4 word (= 8 nibbles, 4 bytes) at GMEM (m, byte_pos). byte_pos and the row stride must both be 4-byte aligned. Coalesced 32-bit writes, vs four scalar 1-byte writes via store_packed_byte.

`set_sf`

set_sf(self: _Self, m: Int, post_col: Int, sf: Float8_e4m3fn)

Set the per-(m, post_col) FP8-E4M3 scale factor.

`input_scale`

input_scale(self: _Self, active_expert_idx: Int) -> Float32

Read per-active-expert input scale (tensor_sf in ep_comm).

Returns:

Float32

`pad_sf_zero_block`

pad_sf_zero_block(self: _Self, sf_block_base: Int, tokens_e: Int, tid: Int, stride: Int)

Zero-fill the per-expert SF tail-pad rows in [tokens_e, ceildiv(tokens_e, 128) * 128) across all post-SwiGLU channels, distributed over stride threads keyed by tid.

Called once per expert, on the CTA that processed the last live tile, so the host doesn't need to memset the SF buffer.

`get_type_name`

static get_type_name() -> String

Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer[DType.float32] would return "DeviceBuffer[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.

Returns:

String: The host type's name.

Provided methods

`copy`

copy(self: _Self) -> _Self

Explicitly construct a copy of self, a convenience method for Self(copy=self) when the type is inconvenient to write out.

Returns:

_Self: A copy of this value.

Implemented traits​

comptime members​

device_type​

Required methods​

__init__​

store_packed_byte​

store_packed_word​

set_sf​

input_scale​

pad_sf_zero_block​

get_type_name​

Provided methods​

copy​

Implemented traits

`comptime` members

`device_type`

Required methods

`init`

`store_packed_byte`

`store_packed_word`

`set_sf`

`input_scale`

`pad_sf_zero_block`

`get_type_name`

Provided methods

`copy`