GPU memory

GPU memory consists of both on-chip memory and external dynamic random-access memory (DRAM), often referred to as device memory (in contrast to the host memory used by the CPU).

On-chip memory includes:

A register file for each streaming multiprocessor (SM), containing the registers used by threads executing on the SMs cores
An L1 cache for each SM to cache reads from global memory
Shared memory for each SM, containing data explicitly shared between the threads of a given thread block executing on the SM
A read-only constant cache for each SM, which caches data read from the constant memory space in global memory
An L2 cache shared by all SMs that is used to cache accesses to local or global memory, including temporary register spills

Device memory includes:

Global memory, which contains data accessible to all threads
Constant memory, which contains data explicitly identified as read-only by the programmer, and which is accessible to all threads
Local memory, which contains data private to an individual thread, such as statically allocated arrays, spilled registers, and other elements of the thread's call stack

Data in global memory persists until explicitly freed, even across kernel functions. This means that one kernel can write data to global memory and then a subsequent kernel can read that data.