Using LayoutTensor
A LayoutTensor
provides a view of multi-dimensional data stored in a linear
array. LayoutTensor
abstracts the logical organization of multi-dimensional
data from its actual arrangement in memory. You can generate new tensor "views"
of the same data without copying the underlying data.
This facilitates essential patterns for writing performant computational
algorithms, such as:
- Extracting tiles (sub-tensors) from existing tensors. This is especially valuable on the GPU, allowing a thread block to load a tile into shared memory, for faster access and more efficient caching.
- Vectorizing tensors—reorganizing them into multi-element vectors for more performant memory loads and stores.
- Partitioning a tensor into thread-local fragments to distribute work across a thread block.
LayoutTensor
is especially valuable for writing GPU kernels, and a number of
its APIs are GPU-specific. However, LayoutTensor
can also be used for
CPU-based algorithms.
A LayoutTensor
consists of three main properties:
- A layout, defining how the elements are laid out in memory.
- A
DType
, defining the data type stored in the tensor. - A pointer to memory where the data is stored.
Figure 1 shows the relationship between the layout and the storage.
Figure 1 shows a 2D column-major layout, and the corresponding linear array of storage. The values shown inside the layout are offsets into the storage: so the coordinates (0, 1) correspond to offset 2 in the storage.
Because LayoutTensor
is a view, creating a new tensor based on an existing
tensor doesn't require copying the underlying data. So you can easily create a
new view, representing a tile (sub-tensor), or accessing the elements in a
different order. These views all access the same data, so changing the stored
data in one view changes the data seen by all of the views.
Each element in a tensor can be either a single (scalar) value or a SIMD vector of values. For a vectorized layout, you can specify an element layout that determines how the vector elements are laid out in memory. For more information, see Vectorizing tensors.
Accessing tensor elements
For tensors with simple row-major or column-major layouts, you can address a layout tensor like a multidimensional array to access elements:
element = tensor2d[x, y]
tensor2d[x, y] = z
element = tensor2d[x, y]
tensor2d[x, y] = z
The number of indices passed to the subscript operator must match the number of coordinates required by the tensor. For simple layouts, this is the same as the layout's rank: two for a 2D tensor, three for a 3D tensor, and so on. If the number of indices is incorrect, you may see a cryptic runtime error.
# Indexing into a 2D tensor requires two indices
el1 = tensor2d[x, y] # Works
el2 = tensor2d[x] # Runtime error
# Indexing into a 2D tensor requires two indices
el1 = tensor2d[x, y] # Works
el2 = tensor2d[x] # Runtime error
For more complicated "nested" layouts, such as tiled layouts, the number of indices doesn't match the rank of the tensor. For details, see Tensor indexing and nested layouts.
The __getitem__()
method returns a SIMD vector of elements, and the compiler
can't statically determine the size of the vector (which is the size of the
tensor's element layout). This can cause type checking errors at compile time,
because some APIs can only accept scalar values (SIMD vectors of width 1). For
example, consider the following code:
i: Int = SIMD[DType.int32, width](15)
i: Int = SIMD[DType.int32, width](15)
If width
is 1, the vector can be implicitly converted to an Int
, but if
width
is any other value, the vector can't be implicitly converted. If width
isn't known at compile time, this produces an error.
If your tensor stores scalar values, you can work around this by explicitly taking the first item in the vector:
element = tensor[x, y][0] # element is guaranteed to be a scalar value
element = tensor[x, y][0] # element is guaranteed to be a scalar value
You can also access elements using the load()
and store()
methods, which
let you specify the vector width explicitly:
var elements: SIMD[DType.float32, 4]
elements = tensor.load[4](x, y)
elements = elements * 2
tensor.store(x, y, elements)
var elements: SIMD[DType.float32, 4]
elements = tensor.load[4](x, y)
elements = elements * 2
tensor.store(x, y, elements)
Tensor indexing and nested layouts
A tensor's layout may have nested modes (or sub-layouts), as described in Introduction to layouts. These layouts have one or more of their dimensions divided into sub-layouts. For example, Figure 2 shows a tensor with a nested layout:
Figure 2 shows a tensor with a tile-major nested layout. Instead of being
addressed with a single coordinate on each axis, it has a pair of coordinates
per axis. For example, the coordinates ((1, 0), (0, 1))
map to the offset 6.
You can't pass nested coordinates to the subscript operator []
, but you can
pass a flattened version of the coordinates. For example:
# retrieve the value at ((1, 0), (0, 1))
element = nested_tensor[1, 0, 0, 1][0]
# retrieve the value at ((1, 0), (0, 1))
element = nested_tensor[1, 0, 0, 1][0]
The number of indices passed to the subscript operator must match the flattened rank of the tensor.
You can't currently use the load()
and store()
methods for tensors with
nested layouts. However, these methods are usually used on tensors that have
been tiled, which yields a tensor with a simple layout.
Creating a LayoutTensor
There are several ways to create a LayoutTensor
, depending on where the tensor
data resides:
- On the CPU.
- In GPU global memory.
- In GPU shared or local memory.
In addition to methods for creating a tensor from scratch, LayoutTensor
provides a number of methods for producing a new view of an existing tensor.
Creating a LayoutTensor
on the CPU
While LayoutTensor
is often used on the GPU, you can also use it to create
tensors for use on the CPU.
To create a LayoutTensor
for use on the CPU, you need a Layout
and a block of
memory to store the tensor data. A common way to allocate memory for a
LayoutTensor
is to use an
InlineArray
:
alias rows = 8
alias columns = 16
alias layout = Layout.row_major(rows, columns)
var storage = InlineArray[Float32, rows * columns](uninitialized=True)
var tensor = LayoutTensor[DType.float32, layout](storage).fill(0)
alias rows = 8
alias columns = 16
alias layout = Layout.row_major(rows, columns)
var storage = InlineArray[Float32, rows * columns](uninitialized=True)
var tensor = LayoutTensor[DType.float32, layout](storage).fill(0)
InlineArray
is a statically-sized, stack-allocated array, so it's a fast and
efficient way to allocate storage for most kinds of LayoutTensor
. There are
target-dependent limits on how much memory can be allocated this way, however.
You can also create a LayoutTensor
using an UnsafePointer
. This may be
preferable for very large tensors.
alias rows = 1024
alias columns = 1024
alias buf_size = rows * columns
alias layout = Layout.row_major(rows, columns)
var ptr = UnsafePointer[Float32].alloc(buf_size)
memset(ptr, 0, buf_size)
var tensor = LayoutTensor[DType.float32, layout](storage)
alias rows = 1024
alias columns = 1024
alias buf_size = rows * columns
alias layout = Layout.row_major(rows, columns)
var ptr = UnsafePointer[Float32].alloc(buf_size)
memset(ptr, 0, buf_size)
var tensor = LayoutTensor[DType.float32, layout](storage)
Note that this example uses memset()
instead of the LayoutTensor
fill()
method. The fill()
method performs
elementwise initialization of the tensor, so it may be slow for large tensors.
Creating a LayoutTensor
on the GPU
When creating a LayoutTensor
for use on the GPU, you need to consider which
memory space the tensor data will be stored in:
- GPU global memory can only be allocated from the host (CPU), as a
DeviceBuffer
. - GPU shared or local memory can be statically allocated on the GPU.
Creating a LayoutTensor
in global memory
You must allocate global memory from the host side, by allocating a
DeviceBuffer
. You can
either construct a LayoutTensor
using this memory on the host side, before
invoking a GPU kernel, or you can construct a LayoutTensor inside the kernel
itself:
- On the CPU, you can construct a
LayoutTensor
using aDeviceBuffer
as its storage. Although you can create this tensor on the CPU and pass it in to a kernel function, you can't directly modify its values on the CPU, since the memory is on the GPU. - On the GPU: When a
DeviceBuffer
is passed in toenqueue_function()
, the kernel receives a correspondingUnsafePointer
in place of theDeviceBuffer.
The kernel can then create aLayoutTensor
using thepointer
.
In both cases, if you want to initialize data for the tensor from the CPU, you
can call
enqueue_copy()
or
enqueue_memset()
on the buffer prior to invoking the kernel. The following example shows
initializing a LayoutTensor
from the CPU and passing it to a GPU kernel.
fn initialize_tensor_from_cpu_example():
alias dtype = DType.float32
alias rows = 32
alias cols = 8
alias block_size = 8
num_blocks = rows * cols // (block_size * block_size)
alias input_layout = Layout.row_major(rows, cols)
fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
if (global_idx.x < tensor.shape[0]() and global_idx.y < tensor.shape[1]()):
tensor[global_idx.x, global_idx.y] = (
tensor[global_idx.x, global_idx.y] + 1
)
try:
var ctx = DeviceContext()
var host_buf = ctx.enqueue_create_host_buffer[dtype](rows * cols)
ctx.synchronize()
var expected_values = List[Scalar[dtype]](rows * cols)
for i in range(rows * cols):
host_buf[i] = i
expected_values[i] = i + 1
var dev_buf = ctx.enqueue_create_buffer[dtype](rows * cols)
ctx.enqueue_copy(dev_buf, host_buf)
var tensor = LayoutTensor[dtype, input_layout](dev_buf)
ctx.enqueue_function[kernel](
tensor,
grid_dim=(num_blocks, num_blocks),
block_dim=(block_size, block_size),
)
ctx.enqueue_copy(host_buf, dev_buf)
ctx.synchronize()
for i in range(rows * cols):
if host_buf[i] != expected_values[i]:
raise Error(
String("Unexpected value ", host_buf[i], " at position ", i)
print("Success")
except error:
print(error)
fn initialize_tensor_from_cpu_example():
alias dtype = DType.float32
alias rows = 32
alias cols = 8
alias block_size = 8
num_blocks = rows * cols // (block_size * block_size)
alias input_layout = Layout.row_major(rows, cols)
fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
if (global_idx.x < tensor.shape[0]() and global_idx.y < tensor.shape[1]()):
tensor[global_idx.x, global_idx.y] = (
tensor[global_idx.x, global_idx.y] + 1
)
try:
var ctx = DeviceContext()
var host_buf = ctx.enqueue_create_host_buffer[dtype](rows * cols)
ctx.synchronize()
var expected_values = List[Scalar[dtype]](rows * cols)
for i in range(rows * cols):
host_buf[i] = i
expected_values[i] = i + 1
var dev_buf = ctx.enqueue_create_buffer[dtype](rows * cols)
ctx.enqueue_copy(dev_buf, host_buf)
var tensor = LayoutTensor[dtype, input_layout](dev_buf)
ctx.enqueue_function[kernel](
tensor,
grid_dim=(num_blocks, num_blocks),
block_dim=(block_size, block_size),
)
ctx.enqueue_copy(host_buf, dev_buf)
ctx.synchronize()
for i in range(rows * cols):
if host_buf[i] != expected_values[i]:
raise Error(
String("Unexpected value ", host_buf[i], " at position ", i)
print("Success")
except error:
print(error)
Creating a LayoutTensor
in shared or local memory
To create a tensor on the GPU in shared memory or local memory, use the
LayoutTensor.stack_allocation()
static method to create a tensor with backing
memory in the appropriate memory space.
Both shared and local memory are very limited resources, so a common pattern is to copy a small tile of a larger tensor into shared memory or local memory to reduce memory access time.
alias tile_layout = Layout.row_major(16, 16)
var shared_tile = LayoutTensor[
dtype,
tile_layout,
MutableAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
alias tile_layout = Layout.row_major(16, 16)
var shared_tile = LayoutTensor[
dtype,
tile_layout,
MutableAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
Tiling tensors
A fundamental pattern for using a layout tensor is to divide the tensor into smaller tiles to achieve better data locality and cache efficiency. In a GPU kernel you may want to select a tile that corresponds to the size of a thread block. For example, given a 2D thread block of 16x16 threads, you could use a 16x16 tile (with each thread handling one element in the tile) or a 64x16 tile (with each thread handling 4 elements from the tensor).
Tiles are most commonly 1D or 2D. For element-wise calculations, where the output value for a given tensor element depends on only one input value, 1D tiles are easy to reason about. For calculations that involve neighboring elements, 2D tiles can help maintain data locality. For example, matrix multiplication or 2D convolution operations usually use 2D tiles.
LayoutTensor
provides a tile()
method for extracting a single tile. You can
also iterate through tiles using the LayoutTensorIter
type.
When tiling a tensor that isn't an exact multiple of the tile size, you can
create the tensor as a masked tensor (with the optional masked
parameter set
to True). When tiling a masked tensor, the tile operations will return partial
tiles at the edges of the tensor. These tiles will be smaller than the requested
tile size. You can use the tensor.dim(axis)
method to query the tile
dimensions at runtime.
Extracting a tile
The
LayoutTensor.tile()
method extracts a tile with a given size at a given set of coordinates:
tensor.tile[32, 32](0, 1)
tensor.tile[32, 32](0, 1)
This call defines a 32x32 tile size, and extracts the tile at row 0, column 1, as shown in Figure 3.
Note that the coordinates are specified in tiles.
The layout of the extracted tile depends on the layout of the parent tensor. For example, if the parent tensor has a column-major layout, the extract tile has a column-major layout.
If you're extracting a tile from a tensor with a tiled layout, the extracted tile must match the tile boundaries of the parent tensor. For example, if the parent tensor is composed of 8x8 row-major tiles, a tile size of 8x8 yields an extracted tile with an 8x8 row-major layout.
If you need to know the type of the tile (to declare a variable, for example),
you can use the tile_type()
method, with the same tile size parameters. Only
use tile_type()
inside the __type_of()
operator.
alias MyTileType = __type_of(tensor.tile_type[32, 32]())
var my_tile: MyTileType
alias MyTileType = __type_of(tensor.tile_type[32, 32]())
var my_tile: MyTileType
Tiled iterators
The LayoutTensorIter
struct provides a way to iterate through a block of memory, generating a layout
tensor for each position. There are two ways to use LayoutTensorIter
:
- Starting with a memory buffer, you can generate a series of tiles.
- Given an existing layout tensor, you can extract a set of tiles along a given axis.
Tiling a memory buffer
When you start with a memory buffer, LayoutTensorIter
iterates through the
memory one tile at a time. This essentially treats the memory as a flat array of
tiles.
alias buf_size = 16
var storage = InlineArray[Int16, buf_size](uninitialized=True)
for i in range(buf_size):
storage[i] = i
alias tile_layout = Layout.row_major(2, 2)
var iter = LayoutTensorIter[DType.int16, tile_layout, MutableAnyOrigin](
storage.unsafe_ptr(), buf_size
)
for i in range(buf_size // tile_layout.size()):
tile = iter[]
print(tile)
print(" - ")
iter += 1
alias buf_size = 16
var storage = InlineArray[Int16, buf_size](uninitialized=True)
for i in range(buf_size):
storage[i] = i
alias tile_layout = Layout.row_major(2, 2)
var iter = LayoutTensorIter[DType.int16, tile_layout, MutableAnyOrigin](
storage.unsafe_ptr(), buf_size
)
for i in range(buf_size // tile_layout.size()):
tile = iter[]
print(tile)
print(" - ")
iter += 1
The iterator constructor takes all of the parameters you'd use to construct a
LayoutTensor
—a DType
, layout, and an origin—and as arguments it takes a
pointer and the size of the memory buffer.
Note that the iterator doesn't work like a standard iterator, and you can't use
it directly in a for
statement like you would use a collection. Instead, you
can use either the dereference operator (iter[]
) or the get()
method to
retrieve a LayoutTensor
representing the tile at the current position.
You can advance the iterator by incrementing it, as shown above. The iterator
also supports next()
and next_unsafe()
methods, which return a copy of the
iterator incremented by a specified offset (default 1). This means you can also
use a pattern like this:
for i in range(num_tiles):
current_tile = iter.next(i)[]
…
for i in range(num_tiles):
current_tile = iter.next(i)[]
…
LayoutTensorIter
also has an optional boolean circular
parameter. A
LayoutTensorIter
created with circular=True
treats the memory buffer as
circular; when it hits the end of the buffer, it starts over again at the
beginning.
Tiling a LayoutTensor
To iterate over a tensor, call the
tiled_iterator()
method, which produces a LayoutTensorIter
:
from layout import LayoutTensor
from math import ceildiv
# given a tensor of size rows x cols
alias num_row_tiles = ceildiv(rows, tile_size)
alias num_col_tiles = ceildiv(cols, tile_size)
for i in range(num_row_tiles):
var iter = tensor.tiled_iterator[tile_size, tile_size, axis=1](i, 0)
for _ in range(num_col_tiles):
tile = iter[]
# … do something with the tile
iter += 1
from layout import LayoutTensor
from math import ceildiv
# given a tensor of size rows x cols
alias num_row_tiles = ceildiv(rows, tile_size)
alias num_col_tiles = ceildiv(cols, tile_size)
for i in range(num_row_tiles):
var iter = tensor.tiled_iterator[tile_size, tile_size, axis=1](i, 0)
for _ in range(num_col_tiles):
tile = iter[]
# … do something with the tile
iter += 1
Vectorizing tensors
When working with tensors, it's frequently efficient to access more than one value at a time. For example, having a single GPU thread calculate multiple output values ("thread coarsening") can frequently improve performance. Likewise, when copying data from one memory space to another, it's often helpful for each thread to copy a SIMD vector worth of values, instead of a single value. Many GPUs have vectorized copy instructions that can make copying more efficient.
To choose the optimum vector size, you need to know the vector operations
supported for your current GPU for the data type you're working with. (For
example, if you're working with 32-bit values and the GPU supports 128-bit copy
operations, you can use a vector width of 4.) You can use the
simdwidthof()
method to find the
appropriate vector width:
from sys.info import simdwidthof
from gpu.host.compile import get_gpu_target
alias simd_width = simdwidthof[DType.float32, get_gpu_target()]
from sys.info import simdwidthof
from gpu.host.compile import get_gpu_target
alias simd_width = simdwidthof[DType.float32, get_gpu_target()]
The vectorize()
method creates a new view of the tensor where each element of the tensor is a
vector of values.
v_tensor = tensor.vectorize[1, simd_width]()
The vectorized tensor is a view of the original tensor, pointing to the same data. The underlying number of scalar values remains the same, but the tensor layout and element layout changes, as shown in Figure 4.
Partitioning a tensor across threads
When working with tensors on the GPU, it's sometimes desirable to distribute the
elements of a tensor across the threads in a thread block. The
distribute()
method takes a thread layout and a thread ID and returns a thread-specific
fragment of the tensor.
The thread layout is tiled across the tensor. The Nth thread receives a
fragment consisting of the Nth value from each tile. For example, Figure 5
shows how distribute()
forms fragments given a 4x4, row-major tensor and a
2x2, column-major thread layout:
In Figure 5, the numbers in the data layout represent offsets into storage, as usual. The numbers in the thread layout represent thread IDs.
The example in Figure 5 uses a small thread layout for illustration purposes. In practice, it's usually optimal to use a thread layout that's the same size as the warp size of your GPU, so the work is divided across all available threads. For example, the following code vectorizes and partitions a tensor over a full warp worth of threads:
alias thread_layout = Layout.row_major(WARP_SIZE // simd_size, simd_size)
var fragment = tile.vectorize[1, simd_size]().distribute[thread_layout](lane_id())
alias thread_layout = Layout.row_major(WARP_SIZE // simd_size, simd_size)
var fragment = tile.vectorize[1, simd_size]().distribute[thread_layout](lane_id())
Given a 16x16 tile size, a warp size of 32 and a simd_size
of 4, this code
produces a 4x16 tensor of 1x4 vectors. The thread layout is an 8x4 row major
layout.
Copying tensors
The layout-tensor
package provides a large set of utilities for copying
tensors. A number of these are specialized for copying between various GPU
memory spaces. All of the layout tensor copy methods respect the layouts—so you
can transform a tensor by copying it to a tensor with a different layout.
LayoutTensor
itself provides two methods for copying tensor data:
copy_from()
copies data from a source tensor to the current tensor, which may be in a different memory space.copy_from_async()
is an optimized copy mechanism for asynchronously copying from GPU global memory to shared memory.
Both of these methods copy the entire source tensor. To divide the copying work
among multiple threads, you need to use distribute()
to create thread-local
tensor fragments, as described in
Partitioning a tensor across threads.
The following code sample demonstrates using both copy methods to copy data to and from shared memory.
fn copy_from_async_example():
alias dtype = DType.float32
alias in_size = 128
alias block_size = 16
alias num_blocks = in_size // block_size # number of block in one dimension
alias input_layout = Layout.row_major(in_size, in_size)
alias simd_width_gpu = simdwidthof[dtype, get_gpu_target()]()
fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
# extract a tile from the input tensor.
var global_tile = tensor.tile[block_size, block_size](block_idx.x, block_idx.y)
alias tile_layout = Layout.row_major(block_size, block_size)
var shared_tile = LayoutTensor[
dtype,
tile_layout,
MutableAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
# Create thread layouts for copying
alias thread_layout = Layout.row_major(
WARP_SIZE // simd_width_gpu, simd_width_gpu
)
var global_fragment = global_tile.vectorize[
1, simd_width_gpu
]().distribute[thread_layout](lane_id())
var shared_fragment = shared_tile.vectorize[
1, simd_width_gpu
]().distribute[thread_layout](lane_id())
shared_fragment.copy_from_async(global_fragment)
async_copy_wait_all()
# Put some data into the shared tile that we can verify on the host.
if (global_idx.x < in_size and global_idx.y < in_size):
shared_tile[thread_idx.x, thread_idx.y] = (
global_idx.x * in_size + global_idx.y
)
barrier()
global_fragment.copy_from(shared_fragment)
try:
var ctx = DeviceContext()
var host_buf = ctx.enqueue_create_host_buffer[dtype](in_size * in_size)
var dev_buf = ctx.enqueue_create_buffer[dtype](in_size * in_size)
ctx.enqueue_memset(dev_buf, 0.0)
var tensor = LayoutTensor[dtype, input_layout](dev_buf)
ctx.enqueue_function[kernel](
tensor,
grid_dim=(num_blocks, num_blocks),
block_dim=(block_size, block_size),
)
ctx.enqueue_copy(host_buf, dev_buf)
ctx.synchronize()
for i in range(in_size * in_size):
if host_buf[i] != i:
raise Error(
String("Unexpected value ", host_buf[i], " at position ", i)
)
print("Success!")
except error:
print(error)
fn copy_from_async_example():
alias dtype = DType.float32
alias in_size = 128
alias block_size = 16
alias num_blocks = in_size // block_size # number of block in one dimension
alias input_layout = Layout.row_major(in_size, in_size)
alias simd_width_gpu = simdwidthof[dtype, get_gpu_target()]()
fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
# extract a tile from the input tensor.
var global_tile = tensor.tile[block_size, block_size](block_idx.x, block_idx.y)
alias tile_layout = Layout.row_major(block_size, block_size)
var shared_tile = LayoutTensor[
dtype,
tile_layout,
MutableAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
# Create thread layouts for copying
alias thread_layout = Layout.row_major(
WARP_SIZE // simd_width_gpu, simd_width_gpu
)
var global_fragment = global_tile.vectorize[
1, simd_width_gpu
]().distribute[thread_layout](lane_id())
var shared_fragment = shared_tile.vectorize[
1, simd_width_gpu
]().distribute[thread_layout](lane_id())
shared_fragment.copy_from_async(global_fragment)
async_copy_wait_all()
# Put some data into the shared tile that we can verify on the host.
if (global_idx.x < in_size and global_idx.y < in_size):
shared_tile[thread_idx.x, thread_idx.y] = (
global_idx.x * in_size + global_idx.y
)
barrier()
global_fragment.copy_from(shared_fragment)
try:
var ctx = DeviceContext()
var host_buf = ctx.enqueue_create_host_buffer[dtype](in_size * in_size)
var dev_buf = ctx.enqueue_create_buffer[dtype](in_size * in_size)
ctx.enqueue_memset(dev_buf, 0.0)
var tensor = LayoutTensor[dtype, input_layout](dev_buf)
ctx.enqueue_function[kernel](
tensor,
grid_dim=(num_blocks, num_blocks),
block_dim=(block_size, block_size),
)
ctx.enqueue_copy(host_buf, dev_buf)
ctx.synchronize()
for i in range(in_size * in_size):
if host_buf[i] != i:
raise Error(
String("Unexpected value ", host_buf[i], " at position ", i)
)
print("Success!")
except error:
print(error)
Thread-aware copy functions
The layout_tensor
package also includes
a number of specialized copy functions for different scenarios, such as copying
from shared memory to local memory. These functions are all thread-aware:
instead of passing in tensor fragments, you pass in a thread layout which the
function uses to partition the work.
As with the copy_from()
and copy_from_async()
methods, use the vectorize()
method prior to copying to take advantage of vectorized copy operations.
Many of the thread-aware copy functions have very specific requirements for the shape of the copied tensor and thread layout, based on the specific GPU and data type in use.
Summary
In this document, we've explored the fundamental concepts and practical usage of
LayoutTensor
. At its core, LayoutTensor
provides
a powerful abstraction for working with multi-dimensional data.
By combining a layout (which defines memory organization), a data type, and a
memory pointer, LayoutTensor
enables flexible and efficient data manipulation
without unnecessary copying of the underlying data.
We covered several essential tensor operations that form the
foundation of working with LayoutTensor
, including creating tensors,
accessing tensor elements, and copying data between tensors.
We also covered key patterns for optimizing data access:
- Tiling tensors for data locality. Accessing tensors one tile at a time can improve cache efficiency. On the GPU, tiling can allow the threads of a thread block to share high-speed access to a subset of a tensor.
- Vectorizing tensors for more efficient data loads and stores.
- Partitioning or distributing tensors into thread-local fragments for processing.
These patterns provide the building blocks for writing efficient kernels in Mojo while maintaining clean, readable code.
To see some practical examples of LayoutTensor
in use, see Optimize custom
ops for GPUs with Mojo .
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!