Mojo trait
SwiGLUOutput
Trait for fused SwiGLU + NVFP4 output destinations.
Implementations:
NullSwiGLUOutput: zero-sized no-op for the BF16-output (non-fused) mode. Every method body compiles away; struct contributes 0 kernel-arg bytes.RealSwiGLUOutput: carries packed-NVFP4 output tensor, 5D FP8-E4M3 scale tile, and per-active-expert input scales.
Trace instrumentation lives on the TraceBufT kernel parameter (see
structured_kernels/trace_buf.mojo), not on this trait β keeps the
output destination concern separate from observability.
Implemented traitsβ
AnyType,
Copyable,
DevicePassable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
comptime membersβ
device_typeβ
comptime device_type
Indicate the type being used on accelerator devices.
Required methodsβ
__init__β
__init__(out self: _Self, *, copy: _Self)
Create a new instance of the value by copying an existing one.
Args:
- βcopy (
_Self): The value to copy.
Returns:
_Self
__init__(out self: _Self, *, deinit take: _Self)
Create a new instance of the value by moving the value of another.
Args:
- βtake (
_Self): The value to move.
Returns:
_Self
store_packed_byteβ
store_packed_byte(self: _Self, m: Int, byte_pos: Int, val: UInt8)
Store one packed-NVFP4 byte (= 2 nibbles) at GMEM (m, byte_pos).
store_packed_wordβ
store_packed_word(self: _Self, m: Int, byte_pos: Int, val: UInt32)
Store one packed-NVFP4 word (= 8 nibbles, 4 bytes) at GMEM (m, byte_pos). byte_pos and the row stride must both be 4-byte aligned. Coalesced 32-bit writes, vs four scalar 1-byte writes via store_packed_byte.
set_sfβ
set_sf(self: _Self, m: Int, post_col: Int, sf: Float8_e4m3fn)
Set the per-(m, post_col) FP8-E4M3 scale factor.
input_scaleβ
input_scale(self: _Self, active_expert_idx: Int) -> Float32
Read per-active-expert input scale (tensor_sf in ep_comm).
Returns:
pad_sf_zero_blockβ
pad_sf_zero_block(self: _Self, sf_block_base: Int, tokens_e: Int, tid: Int, stride: Int)
Zero-fill the per-expert SF tail-pad rows in [tokens_e, ceildiv(tokens_e, 128) * 128) across all post-SwiGLU channels, distributed over stride threads keyed by tid.
Called once per expert, on the CTA that processed the last live tile, so the host doesn't need to memset the SF buffer.
get_type_nameβ
static get_type_name() -> String
Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer[DType.float32] would return "DeviceBuffer[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.
Returns:
String: The host type's name.
Provided methodsβ
copyβ
copy(self: _Self) -> _Self
Explicitly construct a copy of self, a convenience method for Self(copy=self) when the type is inconvenient to write out.
Returns:
_Self: A copy of this value.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!