[PD-dev] threads

Thu Apr 11 23:21:26 CEST 2024

> Yeah, that's just the way that Windows rolls for any program that has 
> a message loop. 
Where did you get that information? A message loop does not create 
threads, it's the other way round: you create a message loop on existing 
threads, e.g. by creating a window (typically on the main thread).

Not to mention that Pd does not contain any windows event loops (except 
for some externals).

> z_ringbuffer appears to be deployed in pretty much only one context 
> which appears to be pretty much isolated to always being in the same 
> thread. 
What do you mean by "in the same thread"? The ringbuffer is used for 
queuing outgoing messages; the message is sent on the audio thread and 
receeived on some other thread (e.g. the UI thread).

Just to be clear: both the portaudio and libpd ringbuffer are 
single-producer-single-consumer, i.e. there must only be a single writer 
thread and a single reader thread at a time.

> As I said above, I am pretty sure that z_ringbuffer's use of atomics 
> is actually incorrect.
It's only "wrong" in the sense that it uses more complex atomics and 
stronger memory order than required, but it does not make the code 
incorrect.

To be more specific: the libpd ringbuffer uses atomic read-modify-write 
operations (with dummy arguments) instead of atomic loads and stores. 
Again, these are hacks from pre-C11 times. Unfortunately, the C11 
version follows this pattern instead of using the more appropriate 
atomic_load[_explicit] and atomic_store_[explicit] functions. Also, it 
implicitly uses the default memory ordering (memory_order_seq_cst) for 
all atomic operations, which is much stronger than what we actually need.

> I haven't yet taken the time to prove this as the z_ringbuffer call 
> sites are very limited, and it clearly works in context
SPSC ringbuffers are a solved problem and one of the most basic lockfree 
datastructures there is. If you understand how atomics work you can just 
look at the code and immediately tell if it's correct or not. There's 
not much to "prove". Here's how a lockfree SPSC ring buffer works, using 
the libpd ringbuffer API as an example:

1. rb_available_to_write: atomically load and compare both the read and 
write pointer. Actually, the loads can be relaxed (the proof is left as 
an excercise to the reader :), but it does not hurt to use 
memory_order_acquire (or higher).

2: rb_available_to_read: same as 1.

3. rb_write_to_buffer: load the write head and write the data, then 
increment it and store it back atomically with memory_order_release (or 
higher)

4. rb_read_from_buffer: first load the read head and read the data, then 
increment it and store it back atomically with memory_order_release (or 
higher)

Now where do you think the libpd ringbuffer implementation is incorrect 
in the sense that it would cause race conditions?

For comparison, here's my own little implementation in C++11: 
https://github.com/Spacechild1/vstplugin/blob/3f0ed8a800ea238bf204a2ead940b2d1324ac909/vst/Lockfree.h#L10-L58 
(which you can assume to be correct :)

If you still don't trust the libpd ringbuffer, feel free to use that 
instead. I hope you are not using C, but if you do, the code can be 
easily adapted to the equivalent C11 functions (see 
https://en.cppreference.com/w/c/thread)

Christof

On 11.04.2024 14:08, Caoimhe &co wrote:
> On Thu, 11 Apr 2024 at 07:52, Christof Ressi <info at christofressi.com> 
> wrote:
> > I get at least four threads. Just for some context: here on Windows 
> I get 10 threads when I open Pd and start DSP, but only 2 of these are 
> active and the remaining 8 or idle.
>
> Yeah, that's just the way that Windows rolls for any program that has 
> a message loop. It's also different when you run under a debugger than 
> when you run  standalone. Why yes, I do get paid not enough to do deep 
> Windows debugging, why do you ask?
>
> > The Pd core itself does not spawn any threads, only the audio 
> backend and certain objects/externals do (notably [readsf~] and 
> [writesf~]).
>
> I was aware of that as the standard story, which is why I was 
> surprised to see four threads, and that it was in (relatively) 
> invariant even when you included to more 'primitive' back-ends like 
> OSS. It seems like this is a bit of lore that should be known, if not 
> particularly documented anywhere.
>
> > But why do you care about the number of threads in the first place?
>
> Because I am working on code which is trying to handle some of the 
> *other* JACK data streams. Ambiguity in thread functionality makes for 
> ambiguity in debugging.
>
> >> As an aside: is the code in z_ringbuffer.{c,h} considered 
> trustworthy? I note that the other code in PD
> >> appears to use the sys_ringbuffer* API, which seems to be built on 
> the PA ringbuffer.
>
> > Is the PA ringbuffer considered trustworthy?
>
> Well it is *in use*, which means that *somebody* considers it 
> trustworthy (in a multi-threaded context). z_ringbuffer appears to be 
> deployed in pretty much only one context which appears to be pretty 
> much isolated to always being in the same thread. But see my above 
> question about threading. I had debugging artifacts which looked like 
> z_ringbuffer was behaving badly under thread race conditions, so I 
> looked at the code, and I am pretty sure that z_ringbuffer's use of 
> atomics is actually incorrect. I haven't yet taken the time to prove 
> this as the z_ringbuffer call sites are very limited.
>
> > Note that the ringbuffer code in "s_audio_ringbuf.c" - for whatever 
> reason - is missing all the memory
> > barriers from the original PA implementation. This happens to work 
> as the implementation is in another
> > source file and (non-inline) function calls act as compiler barriers 
> and Intel has a strong memory model,
> > but if compiled with LTO this code may very well fail on other 
> platforms, particularly on ARM.
>
>  Yeah, I saw that. It's actually *worse* because the header file 
> multi-include protection uses the same preprocessor symbol as the JACK 
> ringbuffer implementation. I fixed that in my local git repo. How do I 
> know this? After a light code read, I switched to using the JACK 
> ringbuffer implementation, which I *do* trust.
>
> >> I ask because I had some problems with z_ringbuffer.c and after a 
> code read, there are some
> >> bits which look sketchy enough to me that I decided to stop using it.
>
> > Which problems did you have? And which bits look sketchy? There are 
> some things that could be
> > improved. The original code has been written before C11, i.e. before 
> C/C++ got an official memory
> > model. As a consequence, the platform specific atomic instructions / 
> memory barriers are stronger
> > than required. In general, SYNC_FETCH should really be called 
> SYNC_LOAD and SYNC_COMPARE_AND_SWAP
> > should be called SYNC_STORE. With C11, SYNC_LOAD could be just an 
> atomic_load (with
> > memory_order_acquire) and SYNC_STORE could be an atomic_store (with 
> memory_order_release).
> > Apart from that, the code looks fine to me.
>
> As I said above, I am pretty sure that z_ringbuffer's use of atomics 
> is actually incorrect. I haven't yet taken the time to prove this as 
> the z_ringbuffer call sites are very limited, and it clearly works in 
> context. I don't think it will work in a more difficult context. The 
> last time I worked with atomics, I ended up writing an automated proof 
> checker to make sure that all of the cases worked correctly. I work 
> with PD when I want to make music more than I want to do advanced CS. 
> And then there's the question of the marginal utility of PD having its 
> own implementation of a ringbuffer, but I will leave that for all of 
> you who have dedicated far more time than I to the maintenance of this 
> project.
>
> - c&co
>
> _______________________________________________
> Pd-dev mailing list
> Pd-dev at lists.iem.at
> https://lists.puredata.info/listinfo/pd-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puredata.info/pipermail/pd-dev/attachments/20240411/45be8aef/attachment-0001.htm>