alsamm (was Re: [PD-dev] Re: [PD] RME hammerfall)

Tue Apr 19 20:24:16 CEST 2005

Hello,

Thanks, I think we should change this in code, I didnt do optimation on this,
just a copy and paste from previous code.

The main point is, that data transfer in send_dacs  is to memory mapped region 
of soundcardbuffer and therefore no other copy is needed, like I think done 
in jack, so it includes the copy (and add) loops burned by jack. 

Anyway if we have a smarter recognition which channels have really 
corresponding dac/adc devices we could zero out the other channels once and 
dont need the copy loop for them which will improve a lot, since on mmap we 
are forced to use all channels or none.

mfg winfried

> after some profiling, i figured out that the alsamm driver is burning a
> lot of cpu during the alsamm_send_dacs ... output of "opreport -l
> /usr/local/bin/pd"
>
>
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        symbol name
> 29630    38.6436  alsamm_send_dacs
> 5847      7.6257  tabosc4_tilde_perform
> 5578      7.2749  block_prolog
> 4451      5.8050  copyvec_simd
> 4362      5.6889  testaddvec_simd
> 3119      4.0678  oss_send_dacs
> 2019      2.6332  peakvec_simd
> 1577      2.0567  sighip_perform
> 1560      2.0346  dsp_tick
> 1410      1.8389  testcopyvec_simd
> 978       1.2755  sigthrow_perfsimd
> 973       1.2690  env_tilde_accum_simd
> 834       1.0877  zerovec_simd
> 780       1.0173  sys_getrealtime
> 698       0.9103  sys_domicrosleep
> 659       0.8595  plus_perf_simd
> <snip>
>
>
> there are two loops that slow down the thing:
>
>   5313  4.8734 :        for (i = 0, fp2 = fp1 + chn*sys_dacblocksize; i <
> oframes; i++,fp2++)
>
>                :          {
>
>   2296  2.1060 :            float s1 = *fp2 * F32MAX;
>
>                :            /* better but slower, better never clip ;-)
>                :               buf[i]= CLIP32(s1); */
>
>   3278  3.0068 :            buf[i]= ((int) s1 & 0xFFFFFF00);
>   1052  0.9650 :            *fp2 = 0.0;
>
>                :          }
>                :      }
>
> and
>
>    253  0.2321 :      for (chn = 0; chn < ichannels; chn++) {
>
>     60  0.0550 :        t_alsa_sample32 *buf = (t_alsa_sample32 *)
> dev->a_addr[chn];
>
>  17254 15.8265 :        for (i = 0, fp2 = fp1 + chn*sys_dacblocksize; i <
> iframes; i++,fp2++)
>
>                :          {
>                :            /* mask the lowest bits, since subchannels info
>                : can make zero samples nonzero */
>
>  10438  9.5744 :            *fp2 = (float) ((t_alsa_sample32) (buf[i] &
> 0xFFFFFF00))
>
>                :              * (1.0 / (float) INT32_MAX);
>                :          }
>                :      }
>
> the problem is, that the samples have to be transfered from the sse
> registers
>
> to the general purpose registers to do the bitmask operations:
>                : 80ba444:       movaps %xmm2,%xmm1
>
>    845  0.7751 : 80ba447:       movss  (%edx),%xmm0
>   1451  1.3309 : 80ba44b:       mulss  %xmm1,%xmm0
>    311  0.2853 : 80ba44f:       cvttss2si %xmm0,%eax
>   1262  1.1576 : 80ba453:       xor    %al,%al
>   1705  1.5639 : 80ba455:       mov    %eax,(%esi,%ecx,4)
>   1052  0.9650 : 80ba458:       movl   $0x0,(%edx)
>   4581  4.2020 : 80ba45e:       add    $0x1,%ecx
>      2  0.0018 : 80ba461:       mov    0xffffffe8(%ebp),%ebx
>    664  0.6091 : 80ba464:       add    $0x4,%edx
>
>                : 80ba467:       cmp    %ebx,%ecx
>
>      4  0.0037 : 80ba469:       jl     80ba447 <alsamm_send_dacs+0x12c>
>
> and
>
>                : 80ba68e:       movaps %xmm2,%xmm1
>
>   4652  4.2671 : 80ba691:       mov    (%esi,%ecx,4),%eax
>  12579 11.5382 : 80ba694:       add    $0x1,%ecx
>
>                : 80ba697:       xor    %al,%al
>
>     70  0.0642 : 80ba699:       cvtsi2ss %eax,%xmm0
>   3665  3.3618 : 80ba69d:       mulss  %xmm1,%xmm0
>   2051  1.8813 : 80ba6a1:       movss  %xmm0,(%edx)
>   3737  3.4278 : 80ba6a5:       add    $0x4,%edx
>    888  0.8145 : 80ba6a8:       mov    0xffffffe0(%ebp),%ebx
>      3  0.0028 : 80ba6ab:       cmp    %ebx,%ecx
>
>                : 80ba6ad:       jl     80ba691 <alsamm_send_dacs+0x376>
>
> i think the better way would be to hardcode these two loops with sse
> instructions, at least for x86 ... not sure, if this is also a problem on
> the ppc platform ...
>
> cheers... tim