[PD] [OT] SSE/MMX tips?

Thu Sep 8 03:30:00 CEST 2011

I noticed that your suggestion did not apply, but assumed it was a subtle riddle taunting me for an offtopic post!

I think the best I can do is 2 vector adds and 2 shifts in place of 4 float adds per 4 floats. Not much of a savings, but with the loop and fetch overhead it may be worth it. I'll benchmark and see!  

It's really just for fun anyway. 

Thanks,
Bill Gribble

On Sep 7, 2011, at 20:59, Mathieu Bouchard <matju at artengine.ca> wrote:

> On Wed, 7 Sep 2011, Mathieu Bouchard wrote:
> 
>> On Wed, 7 Sep 2011, Bill Gribble wrote:
>> 
>>> So far iteration on plain floats seems to be the best I can come up with, but HADDPS is tantalizingly close to what I want to do.  Any hints?
>> 
>> Once I thought that with some commutativity you could speed things up like this :
>> 
>> (f0+f1+f2+f3)+(f4+f5+f6+f7)+...
>> 
>> can be rearranged as :
>> 
>> (f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)
> 
> But what I said does not apply to your case, because you want a scan, whether I didn't really read and assumed a fold.
> 
> I don't know how to optimise a scan.
> 
> _______________________________________________________________________
> | Mathieu Bouchard ---- tél: +1.514.383.3801 ---- Villeray, Montréal, QC