[PD] Re: [PD-dev] [GEM] Further CVS changes

Thu Jan 30 21:21:17 CET 2003

hi,

Am Donnerstag, 30. Januar 2003 20:39 schrieb Tom Schouten:
> i got some (at first glance) counterintuitive results using mmx in pdp too.
> i guess a lot of this strangeness has to do with memory bandwidth. simple
> operations like add or scale are not much faster than their scalar integer
> c counterparts. i did get a lot of speedup for the more compute intensive
> stuff like the biquad filters, iterated convolution and basicly anything
> that needs to do a lot of clipping. also i try to limit the data copying to
> a minimum in pdp, this seems to help too..
>
> the general rule seems to be: keep your memory accesses local and your data
> size small: do as much as possible inside the pixel loop, or iterate
> several times over 1 scanline instead of the whole image.
>
> tom

a while ago i did extensive searches about mmx coding styles.
there are some important issues when writing mmx code, i also had
less efficient code (compared to c) on the first try.

first, the cpu has two instructions pipes. not all mmx instructions
can be executed in any pipe. so you have some that can use boot, and
some that can only use one of them. you need to pair your mmx instructions
to take care of that.

yes, mmx is cache sensitive. when you first load a register from memory, a 
whole chunk gets loaded in the cache. best is to use as much mmx regs as
possible at once, process them, and write them back. this improves the cahce 
hit/miss ration, since the cache is invalid when written back.

some mmx instructions need more than one cycle, so if another instruction
depends on such operation, you get stalls and penality cycles.

there is a good page on that topic: 
http://www.ce.unipr.it/~tommesa/Pixel64.html

it does not tell about the basics of mmx or so, but it shows optimization 
issues on some practical examples and explains it a lot. that page helped
me a lot. sometimes the server seems down, i have a local copy of the site,
so if one needs i can put it on my site for you to grab.

you can also take a look in the mmx code i have written so far.
a lot of functions are done and i tried to make them as fast as possible.
they are for rgba (=32 bit) pixels. they are intended to work on chunks
of a image, but you may try them with a whole image as well.
my guess is that, if you do serveral ops on a image, it is faster
to do chunk1->op1->op2->result1, chunk2->op1->op2->result2.... than
image->op1->op2->result, but i may be wrong (didnt tested it)

the data to the routines however need to be a multiple of 64 byte, as some
routines load all regs with subsequent pixels at once.

hope that helped a little,

greets,

chris