[PD] Making a Realtime Convolution External

Wed Apr 6 06:32:47 CEST 2011

> Just scanned the source... big difference would be performance, and if
> you're picky (you have to be pretty picky, honestly), some difference
> in accuracy due to floating point's reduced precision at large/small
> values. Convolution is still expensive enough for performance to
> really matter.
>
> the biggies:
> - partconv implements a single fixed block size, but freq domain
> convolution is faster by far on bigger blocks (peak on a core duo is
> near 4k sample blocks). implementing growing block sizes makes a big
> difference to low latency performance (e.g. 64 64 128 128 256 256 512
> 512 1024 1024 2048 2048 4096 4096), as you can get low latency while
> most of your convolutions operating on the ideal high-performance
> block size.

I was putting one of these together in Pd vanilla with dynamic
patching as an exercise a few years back, but there were some problems
I had. I think you can just do a simple 64 128 256 512 etc. and let
the block delay take care of the timing automatically, but I actually
found the kind you posted here to work a little better. Another one
that worked even better was something like 64 32 32 64 64 128 128 256
256 etc., which seemed to front-load some of the calculation a little
(and with this one and the one you posted, if Pd's block size were 1,
you could do the first block as a direct convolution for extreme
low-latency).

Anyway, this brings up a problem I've been wondering about with Pd --
If you have lots of reblocking going on I have been assuming that if
you had, say, one patch blocked at 64, another at 128, and others at
256 512 1024 2048 and 4096, that at the end of the 4096 block all 7
patches will have just finished a block cycle and there will therefore
be a CPU spike relative to other places between the beginning and end
of the 4096 block as the calculation for all 7 is done. Is there a way
in Pd to offset larger blocks by a given number of samples so that the
calculation for that block happens at a different time? It's easy
enough to delay the samples -- that's not what I want. I want to delay
the calculation as well, so that you could deliberately stagger the
blocks and more evenly distribute the calculation in cpu-intensive
situations. I'm imagining something like two 4096 blocks running say,
64 samples apart so that one is does its calculation while the other
is still collecting samples.

Matt

> - vectorization (sse/altivec) of partconv would give a 2-3.5x performance boost
>
> -seth