[PD] [OT] SSE/MMX tips?

Thu Sep 8 03:32:40 CEST 2011

On Wed, Sep 7, 2011 at 7:59 PM, Mathieu Bouchard <matju at artengine.ca> wrote:
> On Wed, 7 Sep 2011, Mathieu Bouchard wrote:
>
>> On Wed, 7 Sep 2011, Bill Gribble wrote:
>>
>>> So far iteration on plain floats seems to be the best I can come up with,
>>> but HADDPS is tantalizingly close to what I want to do.  Any hints?

Sorry, what's HADDPS?

>>
>> Once I thought that with some commutativity you could speed things up like
>> this :
>>
>> (f0+f1+f2+f3)+(f4+f5+f6+f7)+...
>>
>> can be rearranged as :
>>
>> (f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)
>
> But what I said does not apply to your case, because you want a scan,
> whether I didn't really read and assumed a fold.
>
> I don't know how to optimise a scan.

This is really interesting.  Your compiler probably knows how to
optimize this kind of information.  SSE3 is really about memory
allocation.  The instructions pack floats into a bigger section of
memory.  In SSE3, this means 4 floats in a 128-bit single operation.

SSE 4.2 has 256-bit wide (8 flops per clock)--the latest increase in
single-threaded computing power is in favor of single-precision float
(lucky for Pd-ers)