[PD] how to iterate over left and right channel separately in one Pd class?

Hans-Christoph Steiner hans at at.or.at
Sun Jan 13 03:49:06 CET 2013


Yeah, that makes sense.  With all the auto-vectorization and SIMD support is
recent versions of gcc, it seems a better approach is to tailor the C code to
work well with SIMD-aware compilers.

.hc

On 01/12/2013 04:45 PM, katja wrote:
> It's interesting, but rather compiler-and-processor-specific. Such
> code is maintanance-intensive. At the moment, ARM processors are
> screaming loudest for optimization. Best thing for a community project
> is probably plain C code which reckons with parallel processing,
> because that won't go away for the next few decades. Functions like
> copy_perform8(), times_perform8() etc. can profit from SIMD
> instructions without a need for compiler intrinsics and asm code.
> Well-structured data storage and access can make a 50 % or more
> performance gain, in my experience.
> 
> Another important thing: avoid float precision conversions. Throughout
> Pd there are many untyped float defines and literal constants which
> default to double, and I have introduced more when making libs
> double-ready. Not good. I'll come back to this in another thread.
> 
> Katja
> 
> 
> On Sat, Jan 12, 2013 at 8:14 PM, Hans-Christoph Steiner <hans at at.or.at> wrote:
>>
>> If you are interested, there is still the hand-coded SIMD stuff from pd-devel:
>> https://pure-data.svn.sourceforge.net/svnroot/pure-data/branches/pd-devel/v0-39
>>
>> .hc
>>
>> On 01/12/2013 09:34 AM, katja wrote:
>>> Function copy_perform8() is also eligible for SIMD processing. I used
>>> memcpy() because it is straightforward to use, while Pd's functions
>>> pointed to the wrong locations for this case. On the reverb's total
>>> load there is no significant performance difference.
>>>
>>> Katja
>>>
>>>
>>> On Sat, Jan 12, 2013 at 1:00 AM, Hans-Christoph Steiner <hans at at.or.at> wrote:
>>>>
>>>> I recently learned that libc's memcpy actually uses things like SSE2 or SSSE2
>>>> so it can be quite fast on CPUs from the past 10 years, especially of the last
>>>> 5 years.
>>>>
>>>> It would be worth profiling to see if that's noticeable.
>>>>
>>>> .hc
>>>>
>>>> On 01/11/2013 05:12 PM, katja wrote:
>>>>> Ok so I did the ugly thing with the right channel input and output pointers:
>>>>>
>>>>> memcpy(outR, inR, vectorsize * sizeof(t_float));
>>>>> inR = outR;
>>>>>
>>>>> Works like a charm, thanks again.
>>>>>
>>>>> Katja
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2013 at 10:05 PM, Miller Puckette <msp at ucsd.edu> wrote:
>>>>>> copy_perform assumes the data is 4-byte aligned so might save a test
>>>>>> or two compared to memcopy() - but I really don't know.  I never
>>>>>> benchmarked the two against each other :)
>>>>>>
>>>>>> M
>>>>>>
>>>>>> On Fri, Jan 11, 2013 at 09:36:41PM +0100, katja wrote:
>>>>>>> Hi Miller,
>>>>>>>
>>>>>>> Thanks for the solution. The routines are in place so copying the
>>>>>>> right channel input to output should do it. Is there any reason to
>>>>>>> prefer copy_perform() over memcpy()? I'm trying to make the most
>>>>>>> efficient reverb for RPi & Co.
>>>>>>>
>>>>>>> Katja
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 11, 2013 at 7:57 PM, Miller Puckette <msp at ucsd.edu> wrote:
>>>>>>>> Hi Katja -
>>>>>>>>
>>>>>>>> There's one example of this in sigfft_dspx() - a complex FFT that 'natively'
>>>>>>>> works on 2 signals in-place but has to deal with various cases in which
>>>>>>>> buffers get re-used.  It's ugly but the basic idea is first to get the
>>>>>>>> inputs copied to the outputs (unless they're already there in the correct
>>>>>>>> order in which case nothing needs to be done) and then run the in-place
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> If the algo only works out-of-place (i.e. you need 4 distinct buffers, 2
>>>>>>>> in and 2 out) the only way out is (at least conditionally) allocate temporary
>>>>>>>> copies of the inputs before writing to any outputs.
>>>>>>>>
>>>>>>>> I may be able to add an optional way tilde objects can request that output
>>>>>>>> buffers be distinct from input ones sometime in the future - but this is a
>>>>>>>> couple of steps away for me right now :)
>>>>>>>>
>>>>>>>> M
>>>>>>>>
>>>>>>>> On Fri, Jan 11, 2013 at 03:32:09PM +0100, katja wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm working on a Pd class with stereo channels (reverb), and the
>>>>>>>>> routine happens to be most efficient when iterating over the samples
>>>>>>>>> per channel, instead of left and right together in the perform loop.
>>>>>>>>> However, when doing two while loops in one object, one for left and
>>>>>>>>> one for right, the right channel samples get overwritten because of
>>>>>>>>> sample-wise in-place computation. Is this an inescapable truth? I
>>>>>>>>> mean, I could write a left channel class and a right channel class
>>>>>>>>> (actually did that to verify that it works), but it's inconvenient to
>>>>>>>>> use. What could be an efficient way to get them in one object?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Katja
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pd-list at iem.at mailing list
>>>>>>>>> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pd-list at iem.at mailing list
>>>>>>> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
>>>>>
>>>>> _______________________________________________
>>>>> Pd-list at iem.at mailing list
>>>>> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pd-list at iem.at mailing list
>>>> UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list



More information about the Pd-list mailing list