Hi all, this is the first of a series of posts about CUDA and GPU acceleration. (Next post here)
For some time I've been aware of GPU acceleration, and NVidia, and CUDA, but it was a bit of a black box. Recently I've been working on a cool project which has enabled me to double-click on this to understand what's inside the box.
Maybe it would be good to start with an introduction: what is a GPU, why GPU acceleration, who are NVidia, and what is CUDA.
GPU is an acronym for Graphics Processing Unit. The diagram at right shows the overall architecture of a modern workstation (aka PC).
There's a CPU, with several "cores" (maybe 10 or so), and a GPU, with many many cores (maybe 100 or so). The cores on a GPU are often called Stream Processors, or SPs, for reasons that will be apparent a bit later.
In the parlance of GPUs, the CPU is referred to as the Host, and the GPU is called the Device*.
In addition to the CPU cores the Host has Main Memory (maybe 16GB or so). This memory is somewhat more complicated than a simple box, but for now we'll treat it as a big blob of storage for data. The Device also has its own Graphics Memory (maybe 16GB or so, maybe more). Again, it's more involved than a box, but to start we'll treat it as such. The Device also has a video interface for connecting one or more monitors. This was the original reason for the existence of GPUs, but as we'll see more recently they've been used for other purposes.
The CPU and GPU (or we shall say Host and Device) communicate over a Bus. The Bus is fast (currently about 200MB/s), but not nearly as fast as Main Memory (about 2GB/s) or Graphics Memory (similar, about 2GB/s).
The history goes back to the earliest days of workstations (PCs).
GPUs optimized for highly parallel computing were perfect for applications like image and video processing, and AI/ML computation.
NVidia was founded in 1993, making Graphics Adapters and chips. In 1999 they shipped their first "GPU", coining the term, and heralding the evolution to come. In 2003 they shipped a software toolkit called "Brooks" which provided a way for applications to use GPUs for parallel computing. Then in 2006 they shipped the first version of CUDA, a whole software environment for developing applications which used GPUs. The initial application was of course gaming, which continues to be an important use case today, but it enabled many other uses as well. Including importantly, acceleration of the execution of neural networks, which have revolutionized AI and ML applications.
In 2009 Apple formed a consortium with other GPU manufactures like AMD and Intel, and announced the development of OpenCL, and "open" approach to GPU computing closely patterned on CUDA (which was and remains NVidia-only). Eventually NVidia joined the OpenCL consortium also.
Today you can write applications with CUDA for NVidia GPUs [only], or you can write applications with OpenCL which will run on virtually any GPU [including NVidia]. But CUDA is optimized for NVidia, and NVidia remains the leading GPU vendor. If you've developed an application in CUDA it isn't too difficult to migrate to OpenCL, because of the architectural similarity.
CUDA is an environment with four main pieces:
CUDA programs are written in C/C++, and are compiled/preprocessed with NVidia's C++ compiler nvcc. There are slight extensions to the language, and source programs are named as .cu instead of .c or .cpp. A CUDA program contains some logic which runs on the CPU/Host, and some which runs on the GPU/Device. The Host code is preprocessed by nvcc and then passed through to gcc (the Gnu C++ compiler) for code generation. The Device code is compiled into an NVidia-specific bytecode called PTX.
PTX is stored as data inside the generated code. At execution time the PTX is passed to the Device by the NVidia driver, and then processed by a JIT compiler on the Device. CUDA programs are linked with CUDA libraries, which implement the CUDA API. A single EXE results, which contains the Host code as machine instructions and the Device code as PTX. This architecture is simple - only one EXE file contains both the Host and Device logic - and enables support for a wide variety of devices, since the Device itself interprets and executes the Device code at execution time.
NVidia provides great integration for CUDA programs and nvcc with many development tools, including importantly Visual Studio and XCode.
It's worth mentioning that there are many libraries available which implement CUDA "under the covers" for programs written in other languages, including importantly Python (CuPy, PyNum, PyCuda, PyTorch, MXNet, and TensorFlow), C# (for Windows), Objective C (for IOS), and Java (for Android). For many applications it isn't necessary to delve this deeply into CUDA programming; you can just pick the right library, use it, and happily take advantage of GPU acceleration.