Fri Apr 22 10:53:15 CEST 2005

> neither is sickle version of matrix~, nor any other sickle clone.
> When I was experimenting back then with gcc-2.95, the consistent
> pattern was that -funroll-loops performed slightly better than
> unrolling by hand.
well, that's interesting ...
there are three different factors for speeding up things:
- using SIMD instructions (single instruction multiple data)
- using aligned memory operations movaps / movups
- loop unrolling ... loop unrolling is necessary for simd instructions,
  since the parallel instructions can be used (although i've never
  seen a compiler producing the specific code)

