[PD-dev] per-thread storage in Pd in support of pdlib - discussion?

Thu Jan 26 21:45:07 CET 2012

On Wed, Jan 25, 2012 at 5:32 PM, Peter Brinkmann
<peter.brinkmann at googlemail.com> wrote:

> I don't think users have anything to gain from fine-grained control of
> threads.  That seems like an optimization hint that may or may not be
> helpful, depending on a lot of factors that are not obvious and will differ
> from machine to machine.  In any case, I don't want to have to think about
> threads when patching any more than I want to think about, say, NEON
> optimizations.

I'm still making the case here:
Suppose you're writing a patch and you run up against the limitations
of a single-threaded process.  Then, you take some portion in a
sub-patch and drop in a "thread~" object.  You're able to selectively
add the functionality where it matters to you *and* only when you
actually need it.

The generalizable case is much more preferrable, I agree, but as you
say further on, you might develop an application that incurs
significant overhead--and may not be appropriate for all applications.

The design rationale for PdCUDA (in progress..grumble) is to expose
the programming costs through benchmarks and measurement tools, while
providing user-level control.  It's a different sort of problem--where
mixtures of different kinds of processors are concerned, the
application being designed may not be appropriate for one kind (e.g.
recursive filters on CUDA will run slowly, so just don't put them
there!).

Again... my head's in audio.  I'm still puzzling over the other ideas on topic.

>> > I believe it's much simpler than that.  It should be enough to just do a
>> > topological sort of the signal processing graph; that'll tell you which
>> > objects are ready to run at any given time, and then you can parallelize
>> > the
>> > invocation of their perform functions (or not, depending on how many
>> > processors are available).  I don't think there's any need to explicitly
>> > synchronize much; tools like OpenMP should be able to handle this
>> > implicitly.
>> > Cheers,
>> >      Peter
>>
>> For that--the dspchain (an array of int*) makes a very bad structure.
>> So, you'll want to re-write a handful of functions and data structures
>> around having multiple concurrent branches of computations.  I
>> actually really like this problem :D  I can picture a linked list of
>> dspchains to do this.  But... the description of the sort algorithm
>> really will determine what the data structure ought to be.
>
>
> Hmm, I think that's a pretty standard scheduling problem.  All the necessary
> information is already in Pd, and it's being used when dsp_chain is
> computed.  It's just a matter of representing it as a dependency tree rather
> than a list, and then traversing the tree in the correct order.

I'm going to expand on this a bit here (I think we're on the same
page, but that just doesn't say enough about it).

For clarification/scope and correct me if I'm wrong here.  The
algorithm in d_ugen.c works this way within each dspcontext (see
functions ugen_done_graph and ugen_doit):
Create a list of all ugens in the current context.  Loop over the
list, and for all ugens with no inlet connections, call
ugen_doit--which calls the "dsp" function and proceeds depth-first
scheduling other connected ugens with no unfilled inlets.
If any ugens aren't scheduled, throw an error, make sure outlet
signals are new and set to zero in the parent context.
Delete all the ugens in the graph.

The dependency information is currently just being used as an
intermediate state while creating the dsp_chain structure (upon which
there is no strict dependency information).

The tree structure needs to be generated in place of a single
dsp_chain.  It would just be made up of short dsp_chain structures,
each one representing a depth-first branch (within which the
dependency is essential--these are strictly serial portions, run in a
single thread.  Yay!-code from dsp_tick goes here!).  Then, on a
higher level, you dispatch/schedule those branches.  At the higher
level (the tree), you need to represent the dependencies between
branches, which (at first guess) is a doubly-linked list with
breadth-wise links being potentially concurrent.

hmm... I guess that won't work for block_prolog and epilog as
written.... but it would be fine for most perform routines.

You wouldn't be using the dependency information in real-time to try
to make decisions what to schedule--just generate the new data
structure that tells how the program runs.

Now, that description neglects other concerns I know Miller has
wrestled with--threadsafe delwrite/delread, tabwrite/tabread,
throw/catch, more?

> The real question is whether there's anything to gain from this at all, or
> whether the overhead from parallelization will destroy any gains.  I always
> remember the cautionary tale of a complex system that was carefully designed
> to work with an arbitrary number n of threads, until it was profiled and the
> designers found that it works best when n == 1.

When talking about cluster computing, I had someone once ask: "Is that
a case where the whole is greater than the sum of its parts?"
"It's less.  Always less."

Chuck