For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

RaggedFlashAttentionGPU

struct RaggedFlashAttentionGPU

Registers the mo.mha.ragged.no_cache graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[rank: Int, //, target: StringSlice[ImmStaticOrigin], mask_str: StringSlice[ImmStaticOrigin], local_window_size: Int = Int(-1)](output: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output.static_spec], q: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q.static_spec], k: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=k.static_spec], v: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=v.static_spec], input_row_offsets: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_row_offsets.static_spec], q_max_seq_len: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q_max_seq_len.static_spec], scale: Float32, ctx: DeviceContext)

mo.mha.ragged.no_cache computes flash attention for ragged inputs without KV cache.

The inputs q, k, v are in ragged format with shape [total_seq_len, num_heads, head_dim]. input_row_offsets indicates where each sequence starts and ends in the ragged tensors.

Parameters:

rank (Int): Rank of the q, k, v, and output tensors (inferred).
target (StringSlice[ImmStaticOrigin]): Target device identifier; must resolve to a GPU.
mask_str (StringSlice[ImmStaticOrigin]): Attention mask type string passed to dispatch_mask to select the mask functor.
local_window_size (Int): Sliding window size for windowed attention (defaults to -1, meaning no window).

Args:

output (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output.static_spec]): Output attention tensor with the same shape as q.
q (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q.static_spec]): Query tensor in ragged format with shape [total_seq_len, num_heads, head_dim].
k (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=k.static_spec]): Key tensor in ragged format with shape [total_seq_len, num_heads, head_dim].
v (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=v.static_spec]): Value tensor in ragged format with shape [total_seq_len, num_heads, head_dim].
input_row_offsets (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_row_offsets.static_spec]): Row offsets [batch_size + 1] marking the start and end of each sequence in the ragged tensors.
q_max_seq_len (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=q_max_seq_len.static_spec]): Maximum query sequence length across the batch.
scale (Float32): Scaling factor applied to attention scores before softmax.
ctx (DeviceContext): Device context for GPU execution.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`