How does it work?
To start, let's take a quick example. Let's say we have two big arrays, A and B, and we want to create a third array C, such that for each element c = a2 + b. This is a vastly parallel problem, perfect for GPU acceleration. Here's how this could be done in a vanilla C program:
The main() function is pretty simple; it allocates the three arrays using the standard C runtime function malloc(), initializes the first two arrays A and B, then calls a function called domath() to do the actual computations into array C. And then it deallocates the arrays using the standard free() function.
The domath() function consists of a simple loop through all the array elements. It calls a subfunction called mathalgo() which performs the actual computation.
This vanilla C program is single-threaded; all the computations are executed on a CPU core in a serial fashion. Let's see how this could be accelerated. Here's the same logic implemented in a "chocolate" CUDA program:
There are relatively few changes and poof! we have a CUDA program. At the top, we included two new headers which declare the CUDA API. The main function now has a scope qualifier, __host__, which designates the function to be Host code. This is the default, but for this example we made it explicit. The storage allocation calls to malloc() have been replaced by calls to cudaMallocHost(). They work the same way, allocating a region of storage, but the storage is accessible to the Device as well as the Host. We'll talk about memory in more detail a bit later.
The domath() function is now called with two extra parameters preceding the usual ones, enclosed by triple angle brackets. This is a CUDA extension for "invoking a kernel", that is to say, for calling a function from the Host which will run on the Device. This call is followed by a new call to cudaDeviceSynchronize(), which halts the CPU thread until [all of the] GPU threads have completed. And then the storage is deallocated with cudaFreeHost() instead of free().
Moving up to the domath() function, you'll note it has a scope qualifier of __global__. Global functions run on the Device, but are callable from the Host. Under the covers there is a driver call from the CPU to the GPU which passes the bytecode of the kernel and causes it to be executed. The mathalgo() function has a scope qualifier of __device__, which means it is Device code and only callable from other Device code.
Let's go back and look at those new parameters on the domath() call. They define the number of parallel processing threads to be created on the GPU. We'll talk about those in more detail a bit later, but for now, just know that this creates 1,024 x 64 = 65,636 threads!
The loop in the domath() function has been changed a bit too. If we didn't change it, we'd simply execute the same loop 65,536 times. Instead we want to break up the processing into 65,536 pieces. The CUDA global data items threadId and blockId define unique thread and block Ids for each thread. Using these values, we can give each thread a different starting point - the value of index - and then step through the array to bypass all the other threads - the value of stride. So thread #100 processes array element 100, then 100 + 65,536, then 100 + 2 x 65,536, and so on.
With these changes we've done two things - we've offloaded the computation to the GPU, and we've organized the processing to compute 65,536 array cells in parallel. This is the essence of CUDA coding.