[PD-dev] CUDA discussion

Mon Nov 2 15:18:28 CET 2009

Dear list,

I'd like to start a conversation about CUDA and Pd.

For those of you who don't know, CUDA is a minimal instruction set for
doing single precision floating point calculations on NVIDIA GPUs.
It's a C-based coding paradigm in which blocks of data are copied to
GPU device memory and operations are performed on that data with
thread blocks in increments of 32 threads.  Complete sets of floating
point math functions are available for CUDA.  The CUDA compiler nvcc
works very well alongside gcc.

I've been studying it at work, but have not coded anything for Pd yet.
 There's a whole lot of performance issues based on tiny details in
the documentation--the implementation of cuda based externals could be
made fairly simple for developers if a complete set of CUDA<->Pd
extensions could be coded from the beginning.

Any project worth doing is worth doing right.  So, I want to figure
out if:  a) it's worth doing and b) how to do it right.

I've got a first draft of top-down design issues, and I'd like to make
a list of incremental milestones that would prove the concept is
sound.

top-down design issues:
1.  The essential CUDA<->Pd functions should be made separate from
CUDA based Pd externals, with a separate header file, and compilable
to shared and static libraries.
2.  The set of CUDA<->Pd extensions needs to be able to manage
multiple devices, including device query, initialization and setting
global parameter sets per GPU.  Most likely, this means a custom data
structure and object based method system.
3.  Compilation--how to create the build system and handle
dependencies for a library of CUDA based externals.  Management of
CUDA libraries, CUDA-rt and CUDA-BLAS especially.
4.  Testing and initialization.  At setup time, a CUDA based external
should be able to find out if it is legal and ready to run.
5.  Abstraction of major device and memory operations.  What makes up
a sufficient and complete set of operations?  This is a list that is
most likely to be grown through experimentation, but a good
preliminary list of operations will help get things started on the
right footing.
6.  Performance.  How to profile or benchmark and make comparisons
between implementations?  The single greatest performance issue that I
have identified is caching on GPU.  host<->device memory transfers can
be eliminated in some cases, allowing CUDA based externals to follow
one another in the DSP tree with faster scheduling and runtime
performance.

(proposed) incremental milestones:
1.  Create an external that checks GPUs and hands back error messages to Pd.
2.  Create an external that initializes GPUs.
3.  Create an external that performs host<->device memory transfer and
runs an operation.
4.  Create an external that performs an operation and compares the
time it takes against the same operation on CPU.  At this point, it
should be possible to identify and hopefully quantify the potential
speedup on GPU, and decide whether or not it is worth it.

That's enough for now.  I'd like to know if anyone else has been
thinking along similar lines (CUDA has been out for, like, 2 years or
so now, so I bet that many people know about it), and if you have any
input on the design issues.

Chuck