For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

FlashAttentionGPU

struct FlashAttentionGPU

Registers the mo.mha.no_cache graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[rank: Int, //, target: StringSlice[ImmStaticOrigin], mask_str: StringSlice[ImmStaticOrigin], local_window_size: Int = Int(-1)](output: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output.static_spec], q: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q.static_spec], k: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=k.static_spec], v: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=v.static_spec], scale: Float32, ctx: DeviceContext)

mo.mha.no_cache is a hand-fused operator which does something analogous to the following list of operations.

**Step 0: Transpose: query_processed = transpose(query) # BSHD --> BHSD key_processed = transpose(key) # BSHD --> BHDS value_processed = transpose(value) # BSHD --> BHSD

**Step 1: attentionMatrix = query_processed @ key_processed

**Step 2: norm = broadcast_to(normScalar, shape_of(attentionMatrix))

**Step 3:

Normalize and apply masking

attentionMatrixNormMasked = mask_functor(attentionMatrix * scale)

**Step 4:

Apply softmax and reproject result

attentionMatrixSoftMax = softmax(attentionMatrixNormMasked) answer = attentionMatrixSoftMax @ value_processed answer = transpose(answer) # BHSD --> BSHD

Compared to the CPU patterns the notable differences are:

The transposes are part of the kernel itself

Finally, this pattern supports grouped attention patterns. That is if we have G groups, then let h = H / G. Key and value are allowed to be BShD in these scenarios. Both key and value must be BShD if one is. If this is true the following is equivalently run before Step 0:

** Step -1: key = concat(key, ...) # concat BShD --> BSHD value = concat(value, ...) # concat BShD --> BSHD

The underlying fusion follows ideas taken from the 2022 FlashAttention paper by Tri Dao et al.

Parameters:

rank (Int): Rank of the q, k, v, and output tensors (inferred).
target (StringSlice[ImmStaticOrigin]): Target device identifier; must resolve to a GPU.
mask_str (StringSlice[ImmStaticOrigin]): Attention mask type string passed to dispatch_mask to select the mask functor.
local_window_size (Int): Sliding window size for windowed attention (defaults to -1, meaning no window).

Args:

output (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output.static_spec]): Output attention tensor with the same shape as q.
q (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q.static_spec]): Query tensor in BSHD layout.
k (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=k.static_spec]): Key tensor in BSHD layout, or BShD for grouped attention.
v (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=v.static_spec]): Value tensor in BSHD layout, or BShD for grouped attention.
scale (Float32): Scaling factor applied to attention scores before softmax.
ctx (DeviceContext): Device context for GPU execution.

Implemented traits​

Methods​

execute​

Normalize and apply masking

Apply softmax and reproject result

Implemented traits

Methods

`execute`