alsamm (was Re: [PD-dev] Re: [PD] RME hammerfall)
Winfried Ritsch
ritsch at iem.at
Tue Apr 19 20:24:16 CEST 2005
Hello,
Thanks, I think we should change this in code, I didnt do optimation on this,
just a copy and paste from previous code.
The main point is, that data transfer in send_dacs is to memory mapped region
of soundcardbuffer and therefore no other copy is needed, like I think done
in jack, so it includes the copy (and add) loops burned by jack.
Anyway if we have a smarter recognition which channels have really
corresponding dac/adc devices we could zero out the other channels once and
dont need the copy loop for them which will improve a lot, since on mmap we
are forced to use all channels or none.
mfg winfried
> after some profiling, i figured out that the alsamm driver is burning a
> lot of cpu during the alsamm_send_dacs ... output of "opreport -l
> /usr/local/bin/pd"
>
>
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples % symbol name
> 29630 38.6436 alsamm_send_dacs
> 5847 7.6257 tabosc4_tilde_perform
> 5578 7.2749 block_prolog
> 4451 5.8050 copyvec_simd
> 4362 5.6889 testaddvec_simd
> 3119 4.0678 oss_send_dacs
> 2019 2.6332 peakvec_simd
> 1577 2.0567 sighip_perform
> 1560 2.0346 dsp_tick
> 1410 1.8389 testcopyvec_simd
> 978 1.2755 sigthrow_perfsimd
> 973 1.2690 env_tilde_accum_simd
> 834 1.0877 zerovec_simd
> 780 1.0173 sys_getrealtime
> 698 0.9103 sys_domicrosleep
> 659 0.8595 plus_perf_simd
> <snip>
>
>
> there are two loops that slow down the thing:
>
> 5313 4.8734 : for (i = 0, fp2 = fp1 + chn*sys_dacblocksize; i <
> oframes; i++,fp2++)
>
> : {
>
> 2296 2.1060 : float s1 = *fp2 * F32MAX;
>
> : /* better but slower, better never clip ;-)
> : buf[i]= CLIP32(s1); */
>
> 3278 3.0068 : buf[i]= ((int) s1 & 0xFFFFFF00);
> 1052 0.9650 : *fp2 = 0.0;
>
> : }
> : }
>
> and
>
> 253 0.2321 : for (chn = 0; chn < ichannels; chn++) {
>
> 60 0.0550 : t_alsa_sample32 *buf = (t_alsa_sample32 *)
> dev->a_addr[chn];
>
> 17254 15.8265 : for (i = 0, fp2 = fp1 + chn*sys_dacblocksize; i <
> iframes; i++,fp2++)
>
> : {
> : /* mask the lowest bits, since subchannels info
> : can make zero samples nonzero */
>
> 10438 9.5744 : *fp2 = (float) ((t_alsa_sample32) (buf[i] &
> 0xFFFFFF00))
>
> : * (1.0 / (float) INT32_MAX);
> : }
> : }
>
> the problem is, that the samples have to be transfered from the sse
> registers
>
> to the general purpose registers to do the bitmask operations:
> : 80ba444: movaps %xmm2,%xmm1
>
> 845 0.7751 : 80ba447: movss (%edx),%xmm0
> 1451 1.3309 : 80ba44b: mulss %xmm1,%xmm0
> 311 0.2853 : 80ba44f: cvttss2si %xmm0,%eax
> 1262 1.1576 : 80ba453: xor %al,%al
> 1705 1.5639 : 80ba455: mov %eax,(%esi,%ecx,4)
> 1052 0.9650 : 80ba458: movl $0x0,(%edx)
> 4581 4.2020 : 80ba45e: add $0x1,%ecx
> 2 0.0018 : 80ba461: mov 0xffffffe8(%ebp),%ebx
> 664 0.6091 : 80ba464: add $0x4,%edx
>
> : 80ba467: cmp %ebx,%ecx
>
> 4 0.0037 : 80ba469: jl 80ba447 <alsamm_send_dacs+0x12c>
>
> and
>
> : 80ba68e: movaps %xmm2,%xmm1
>
> 4652 4.2671 : 80ba691: mov (%esi,%ecx,4),%eax
> 12579 11.5382 : 80ba694: add $0x1,%ecx
>
> : 80ba697: xor %al,%al
>
> 70 0.0642 : 80ba699: cvtsi2ss %eax,%xmm0
> 3665 3.3618 : 80ba69d: mulss %xmm1,%xmm0
> 2051 1.8813 : 80ba6a1: movss %xmm0,(%edx)
> 3737 3.4278 : 80ba6a5: add $0x4,%edx
> 888 0.8145 : 80ba6a8: mov 0xffffffe0(%ebp),%ebx
> 3 0.0028 : 80ba6ab: cmp %ebx,%ecx
>
> : 80ba6ad: jl 80ba691 <alsamm_send_dacs+0x376>
>
> i think the better way would be to hardcode these two loops with sse
> instructions, at least for x86 ... not sure, if this is also a problem on
> the ppc platform ...
>
> cheers... tim
More information about the Pd-dev
mailing list