[PD-dev] [GEM] Further CVS changes

Sun Feb 2 20:39:40 CET 2003

On Thu, 30 Jan 2003, chris clepper wrote:
> On Thu, 30 Jan 2003, Tom Schouten wrote:
>> i got some (at first glance) counterintuitive results using mmx in
>> pdp too. i guess a lot of this strangeness has to do with memory
>> bandwidth. simple operations like add or scale are not much faster
>> than their scalar integer c counterparts.

You have to be careful to do the same optimisations in asm as in C
code... I have done that mistake...

>> also i try to limit the data copying to a minimum in pdp,
>> this seems to help too..

I haven't got to the point where data copying is really minimized in my
software, but this task is much more of a challenge for me, as the data my
objects accept are of wildly differing dimensions, so there's a clash
between the strategies for handling small chunks and big chunks of data.

> The structure of the processing chain is also a big factor.  GEM is 
> basically a chain of for loops, which probably isn't ideal, but it is 
> quite flexible.
> Matju, is GridFlow building a single loop and filling it with
> functions from a table?

No, it is a packet-based system. A grid message contains the GridOutlet
pointer of the sender; the receivers return their GridInlet pointers. Then
the sender tells GridOutlet to send() or give(), which may buffer and/or
send packets to the GridInlets. Then the GridInlets may slightly
repacketize the data so that it comes in nice multiples of N (as specified
by receivers for their own convenience...). And then object-specific code
is called with a packet as a parameter... and so on...

In GridFlow there is a distinction between a "numeric operator" and a
"grid operator". The former is really simple, it is a function that takes
one or two arguments of a given number type (but actually every such
operator comes with six differently vectorized versions of itself). The
latter is quite complex code that may involve any number of nested
for-loops in non-obviously-optimisable ways.

So what happens typically is that on a given chunk of data, a vectorized
numeric operator is called, and then another is called, etc., in the midst
of higher-level loops and memcpy()'s and so on. The result is that there
is a lot of RAM access, *BUT* it is mostly at the cache level because of
the packetizing.

> That seems like it could be really efficient, especially with the
> decrease in memory accesses between objects.

I very well see myself adding a runtime machine code generator to
GridFlow, which would take a few loop forms and fill-in the blanks. Right
now, however, the strangest I've done is writing a script that generates
GridFlow's _asm_ code. I still have to use the "nasm" program to convert
to *.o and link it into gridflow.so ... And then I still have to figure
out how that generator would fit with my processing model (I already have 
a few ideas)

This would be, of course, pentium-only (... for as long as I'm the only
person working on GridFlow; i only have PC's)

________________________________________________________________
Mathieu Bouchard                       http://artengine.ca/matju