[PD] variance from mapping library

Thu Mar 19 17:50:26 CET 2009

On Thu, 19 Mar 2009, Oded Ben-Tal wrote:

>> abs(a)+abs(b), or something like that. But 100000*100000 = 10000000000, and 
>> if you divide that by 2^24 = 16777216, you get about 596, which is an upper 
>> bound for the amount of error: so, the error is surely between -596 and 
>> +596.
> I trust your math here but just notice that your example converges to -691.

That's because 100000*100000 is only one value. Then, the second [mean_n] 
has to process + 90000*90000 + 80000*80000 + 70000*70000 + ... so the 
theoretical error maximum is much more than 596 but much less than 596*10. 
In practice, much of the individual errors are not that big and perhaps 
some of them cancel each other.

But to find the reason for why -691 precisely, would take a long time and 
would not be any insightful.

> But if I understand you correctly 'filtering' the input data through [int] 
> should make variance error free (we hope).

no, it won't, because still all of the other objects process floats. The 
reason why ints wouldn't have that problem is because they have fixed 
precision, that is, the step between two adjacent numbers is 1, whereas 
for floats it's roughly proportional to the numbers themselves. For 
integers you will hit an overflow problem quite quickly, and so, for 
example, if you remake that abstraction using 32-bit integers (for 
example, using the GridFlow library) then you can already get an overflow 
by using random 5-digit numbers, but at least, it goes back to normal when 
given a more modest sequence, whereas for floats it gets stuck and needs 
to be reset (recreated).

Using int64 you could get perfect results for most uses, but I don't 
recall whether the bugs in GridFlow's int64 support were fixed or not... I 
never quite had a use for int64 in the end.

For the "mapping" library, there isn't much of a choice but to remake it 
with a slower algorithm, unless someone knows a magic trick for cancelling 
almost all of the error while not running so slow. Actually, it already 
runs in linear time, so it wouldn't be such a big loss if the complete sum 
was recomputed at every step, because it would still be linear. It would 
be a big loss if it could run in constant time (e.g. using an array for 
the queue) and had to be switched to linear time.

  _ _ __ ___ _____ ________ _____________ _____________________ ...
| Mathieu Bouchard - tél:+1.514.383.3801, Montréal, Québec