While my example is graphical, only the SDL frame buffer capability of the GPU is being used. All object rendering computations are done by my CUDA code running on the GPU -- the card doesn't know this is a graphical computation.
The NVIDIA GPU has multiple multiprocessors: 12 on the GeForce 8800 GTS I tested. Each multiprocessor has 8 processors, and with them each multiprocessor can support up to 768 concurrent threads.
Under CUDA the programmer writes a kernel which is the code executed by a single thread. Threads are placed in two or three-dimensional groupings called blocks, and blocks in turn are placed in two-dimensional groupings called grids. The run-time system schedules blocks and grids to run as resources are available. The general computational model is SIMD: the same kernel instructions are executed by all threads. The NVIDIA C compiler supports fairly standard C with branches, loops, etc., but no recursion.
Like the SPUs on the PS3 Cell processor, GPU memory is limited and access to host memory is slow. Data is explicitly moved from the host to the device by the host program. Here data is in device global memory (shared between all multiprocessors), which is still quite slow. The kernel threads can copy data from global memory to shared memory (shared between processors in a single multiprocessor), though shared memory is limited to 16K per multiprocessor.
With this design CUDA is not a practical way to implement a ray-tracer. The 16K shared memory restriction makes realistic world models impractical. It appears the number of threads which can be started simultaneously is constrained by the size of the world state structure:
Here we see that with a small number of objects in the world the program runs faster with shared memory. With a large number of objects in the world it is actually faster without shared memory (and shared memory cannot fit more than 225 objects). The Shared memory speedup is independent of whether all the objects are actually rendered. The program bottleneck is not in looping over the objects, but in the speed of access to object data and the number of threads which can be simultaneously started.
Below shows CPU and memory utilization during NVIDIA CUDA GPGPU, generic 1 pthread, and generic 4 pthreads tests. In the CUDA test my ray program used 92% of one CPU and the Xorg X-windows server used 12% of another CPU.
int screenX;
#ifdef USE_OMP
#pragma omp parallel for firstprivate(portPoint)
#endif
for (screenX = 0;
screenX < SCREEN_WIDTH;
screenX++) {
With this change each pixel is rendered by a separate thread, with a maximum
of 4 (the number of cores) running simultaneously. This use of threads is
an alternative to the explicit pthread threading tested above, where a
separate thread
is used per line of the screen. In this case OpenMP is simpler than explicit
threading, though it yields slightly lower performance.
GCC 4.2 also supports OpenMP, but its performance is much worse than Intel.
Threads Frames Per Second ---------------------------- 1 8.5 2 17.4 GPU 16.7
Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz Linux hex 2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010 x86_64 GNU/Linux Ubuntu 10.10
This testing was paired with Go ACO TSP.
Here hyper-threading has improved performance for both programs in all configurations. At one core both programs were sped up 28%. At 6 cores ray tracing was sped up 9% versus non-HT, while ACO TSP was sped up 18% versus non-HT. The higher speedup for ACO TSP is expected as it matches the general speedup from increased cores.
Both programs were run with 24 threads in all test configurations. Having more program threads than cores*hardware_threads_per_core is key. Otherwise the OS may schedule multiple threads on the same core while leaving others idle. Linux is hyper-theading aware and attempts to avoid this scheduling problem, but doesn't always get it right. In testing with 6 threads and 6 cores, I saw 28% decreases in performance for both programs when hyper-threading was enabled.
The Intel i7-980X also supports "Turbo Boost", where one core is automatically overclocked when other cores are idle. Turbo boost was disabled for the tests above. With one core and HT disabled, Turbo boost provides a speedup of 8% for both programs. Surprisingly, with six cores and HT enabled (so all cores should be fully utilized) Turbo boost provides a speedup of 5% for both programs.
The OpenCL CPU version provided similar results to the pthreads-generic version tested earlier. With 10 objects in the scene the GPU performance was only slightly better than the CPU version. But as the number of objects increased the GPU version maintained performance much better than the CPU version. This indicates that with low object counts the actual work done by each GPU core is small relative to the setup and copying results back overheads. Yes, a realistic ray tracer would be culling the scene graph so fewer objects would be evaluated for each point. And the GPU version ran out of constant memory above 800 objects. But this is much better scaling and performance than the NVIDIA CUDA test I did a few years ago.