[PD] [PD-dev] CUDA discussion

Mon Nov 16 00:52:48 CET 2009

On Sat, Nov 14, 2009 at 3:57 AM, Claude Heiland-Allen
<claudiusmaximus at goto10.org> wrote:
>> Probably, if we come up with anything in
>> the way of top-down design, we should be able to apply the same
>> framework to different platforms and make comparisons.
>
> My initial thoughts on top down design, having read only the OpenCL intro
> slides and not written any code:
>
> It will be difficult to get a "OpenCL API as Pd objects" approach to be
> efficient, there are too many timing/asynchrony issues as far as I can tell
> so far.  It would also be undesireable to have to expect all users to know
> how OpenCL works in detail, as it's quite tricky...

It is tricky.  I favor an incremental approach, and maybe look at
timing issues as optimizations.  I think an open platform OpenCL
library would make a CUDA lib obsolete, but CUDA seems simpler at the
moment.  So, I can say with some confidence what it does and how.
Once its finished for CUDA,

With CUDA, your GPU operations are non-blocking, but you can call
syncthreads and wait until they finish, and your users don't even have
to know they were non-blocking.  Even without any optimizations,
CUDA/OpenCL will be useful for speeding up single operations on very
large block sizes and can be written to require no special knowledge
of what's going on.

There's one approach where we implement only the functions needed for
a lot of speedup, fft~ *~ for example.  Here we can get the benefit of
using higher overlap factors and larger block sizes in patches without
audio dropouts.  It's transparent, simple, and minimal effort.

> An alternative idea is to have an [opencl~] object that sits in a subpatch
> like [switch~] or [block~] does, that analyses the DSP sub-graph for that
> subpatch and "lifts" the whole dsp part of the subpatch to the GPU.
>  Probably it would require some magic to get [+~] etc to work directly, so
> maybe a compromise would be [cl/+~] etc, then the [opencl~] object would
> "compile" these objects into a kernel at "; pd dsp 1" time.  Then [inlet~]
> and [outlet~] would do the transfer to/from GPU (if the containing patch
> isn't on GPU, otherwise it would just pass a pointer to GPU memory...).
>
> And if [import] or whatever works to a sufficient extent(*), it should be
> possible to go back to using [+~] and [import cl], with a backend switchable
> so you could [import simd] instead to use regular SIMD-optimized dsp
> objects, or [import vanilla] to use the normal unoptimized Pd objects,
> etc...
>
> The first idea is simpler to implement but is horrible for users, the second
> idea is simpler for users but requires horrible knowledge of Pd internals to
> implement.  The goal of libraries is to make life better for library users,
> I think - so that makes the second idea more attractive.
>
> (*) by which I mean [import] or other similar object works on a subpatch
> basis, or at least on a .pd file basis, so abstractions can have different
> libraries to their containing patches if they desire without clobbering
> anything globally, and imported objects can override global objects with the
> same name, etc...
>
>
> Claude
> --
> http://claudiusmaximus.goto10.org
>

I like where this is going.  [cl/inlet~] and [cl/outlet~] would be the
points where data is explicitly transferred to/from the GPU.  Without
[import cl] they default to vanilla inlets, and when [import cl]
works, they act just like a regular inlet except performing the
transfer.

btw, the reason why caching-on-gpu is an important feature is that
each transfer has latency and is slow compared to GPU memory speed.
High end GPUs can have more than 50GB/s memory bandwidth, but over
PCIex16 you will only see about 3GB/s.  Scheduling on GPU is pretty
fast, so the interruption between calls can be made short as long as
there's no data transfer.

Wrapping up [cl/] functionality to work in just the way you described
is great, because it would allow a good framework for optimizations
between objects.  There's all kinds of scenarios for misuse, but it
should be simple enough for users to learn.

To begin with, I'm writing externals with explicit prefixes, like
cuda_query~, cuda_copy~.  Then, drop the prefixes and test the next
part.

Chuck