Mojo function

shuffle_idx

shuffle_idx[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]

Copies a value from a source lane to other lanes in a warp.

Broadcasts a value from a source thread in a warp to all participating threads
without using shared memory. This is a convenience wrapper that uses the full
warp mask by default.
Broadcasts a value from a source thread in a warp to all participating threads
without using shared memory. This is a convenience wrapper that uses the full
warp mask by default.

Example:

    from gpu.warp import shuffle_idx

    val = SIMD[DType.float32, 16](1.0)

    # Broadcast value from lane 0 to all lanes
    result = shuffle_idx(val, 0)

    # Get value from lane 5
    result = shuffle_idx(val, 5)
    from gpu.warp import shuffle_idx

    val = SIMD[DType.float32, 16](1.0)

    # Broadcast value from lane 0 to all lanes
    result = shuffle_idx(val, 0)

    # Get value from lane 5
    result = shuffle_idx(val, 5)

Parameters:

type (DType): The data type of the SIMD elements (e.g. float32, int32, half).
simd_width (Int): The number of elements in each SIMD vector.

Args:

val (SIMD[type, simd_width]): The SIMD value to be broadcast from the source lane.
offset (SIMD[uint32, 1]): The source lane ID to copy the value from.

Returns:

A SIMD vector where all lanes contain the value from the source lane specified by offset.

shuffle_idx[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]

Copies a value from a source lane to other lanes in a warp with explicit mask control.

Broadcasts a value from a source thread in a warp to participating threads specified by
the mask. This provides fine-grained control over which threads participate in the shuffle
operation.
Broadcasts a value from a source thread in a warp to participating threads specified by
the mask. This provides fine-grained control over which threads participate in the shuffle
operation.

Example:

    from gpu.warp import shuffle_idx

    # Only broadcast to first 16 lanes
    var mask = 0xFFFF  # 16 ones
    var val = SIMD[DType.float32, 32](1.0)
    var result = shuffle_idx(mask, val, 5)
    from gpu.warp import shuffle_idx

    # Only broadcast to first 16 lanes
    var mask = 0xFFFF  # 16 ones
    var val = SIMD[DType.float32, 32](1.0)
    var result = shuffle_idx(mask, val, 5)

Parameters:

type (DType): The data type of the SIMD elements (e.g. float32, int32, half).
simd_width (Int): The number of elements in each SIMD vector.

Args:

mask (UInt): A bit mask specifying which lanes participate in the shuffle (1 bit per lane).
val (SIMD[type, simd_width]): The SIMD value to be broadcast from the source lane.
offset (SIMD[uint32, 1]): The source lane ID to copy the value from.

Returns:

A SIMD vector where participating lanes (set in mask) contain the value from the source lane specified by offset. Non-participating lanes retain their original values.