[PD-dev] [GEM] Further CVS changes

chris clepper cclepper at artic.edu
Thu Jan 30 20:59:53 CET 2003

>  > > I have also done experiments with MMX (which, I have to
>>  > admit, did not give the results I had hoped for, but maybe just
>>  > because I did not really know what I was doing ).
>>  I have added MMX code to my software; the asm code is generated with a
>>  script. The results I get with int32 are slightly slower than GCC's
>>  non-MMX output, and I'm doing pretty close to my best. However with int16
>>  and uint8 the MMX gets a certain percentage of improvement, though really
>>  not extraordinary... 30-40% ? maybe it's all the packet-handling going
>>  on around that makes the improvement appear less than it really is?
>i got some (at first glance) counterintuitive results using mmx in pdp too. i
>guess a lot of this strangeness has to do with memory bandwidth. simple
>operations like add or scale are not much faster than their scalar integer c
>counterparts. i did get a lot of speedup for the more compute intensive stuff
>like the biquad filters, iterated convolution and basicly anything that needs
>to do a lot of clipping. also i try to limit the data copying to a minimum in
>pdp, this seems to help too..
>the general rule seems to be: keep your memory accesses local and your data
>size small: do as much as possible inside the pixel loop, or iterate several
>times over 1 scanline instead of the whole image.

These points seems to hold true for all SIMD types.  Altivec is 
pretty much limited by memory bandwidth so it pays to do as much 
calculation on the data between memory accesses. In Altivec, there 
are also cache control functions to open up dedicated cache-lines to 
the vector unit, which help decrease memory load latencies.  Maybe 
there exists something similar to this for MMX?

The structure of the processing chain is also a big factor.  GEM is 
basically a chain of for loops, which probably isn't ideal, but it is 
quite flexible.

Matju, is GridFlow building a single loop and filling it with 
functions from a table?  That seems like it could be really 
efficient, especially with the decrease in memory accesses between 

It's great that we are sharing all of these tips, findings and ideas, 
among developers working on various projects.



More information about the Pd-dev mailing list