[PD] standard encoding of pd files in each system?

Sat Jan 29 22:01:08 CET 2011

moin Mathieu, moin all,

On 2011-01-29 17:12:02, Mathieu Bouchard <matju at artengine.ca> appears to
have written:
> On Fri, 28 Jan 2011, Bryan Jurish wrote:
> 
>> iirc, Miller has indicated in the past that he feels this sort of
>> thing should be done using arrays.
> 
> But a feeling is but a feeling. Now, how about a justification ?
> But that's not the sort of thing one gets from Miller often.
> 
>>  (B) you must scale all size attributes (e.g. for re-allocation) by
>> 1.0/sizeof(t_float), so to get an accurate byte length that is not a
>> multiple of sizeof(t_float), you need to actually store that length
>> additionally somewhere else
> 
> sizeof(t_float) is always a power of two, isn't it ? I haven't heard of
> anyone using 80-bit or 96-bit floats as t_float or t_sample.
> 
> thus a size stored as float will be accurate up to 16777216.
> 
> This is regardless of whether you store size*4, size, or size/4 : floats
> are quite scale-independent, but are perfectly so when the scalings are
> powers of two (provided you don't overflow by scaling by pow(2,128) or so)
> 
> I think you could read a bit about the IEEE-754 standard :
>   http://en.wikipedia.org/wiki/IEEE_754
> 
> But especially some kind of short, direct tutorial that will make it
> obvious what won't be rounded and what will be :
>   http://kipirvine.com/asm/workbook/floating_tut.htm

Yup, all freely stipulated.  My issue was not so much with the use of
floats qua floats to store size data, rather the necessity of storing
size data *in addition to* the size reported by the array itself.  In
this scenario, we're blatantly re-casting the array's (t_float*) into a
(char*) and reading/writing raw bytes.  But maybe we don't want C-style
NUL-terminated strings, but rather perl-ish (or Berkeley DB-ish) strings
which admit embedded NULs and store their length in an additional
dedicated attribute (usually an unsigned int, but sure, we could use a
float if we wanted).  The problem is that if we (ab)use the existing
garray API (garray_getfloatarray(), garray_npoints(), garray_resize())
to do this, then the sizes reported for the array may be longer than the
size of the string.  My system uses 32-bit floats, and say I want a
string "foo" (without terminating NUL).  Well, "foo" takes up less than
the space of 1 float (3 bytes < 4 bytes), but garray_npoints() for a
float array of size 1 is going to give me 1, and 1*sizeof(float) = 4 >
3, so if I want to implement strings this way, I've got to fiddle around
with some additional convention for storing their actual length.

It looks as if the whole garray stuff is defined abstractly enough to
handle more than just "plain" float arrays, but I haven't dug deep
enough to figure out what exactly those (t_template*)s are all about or
how I might be able to (ab)use them...

>>  (C) saving array data with a patch and re-loading can cause data loss
>> (float truncation may mess up raw byte values)
> 
> for integers, all values from -1000000 to 1000000 will be correctly
> saved (those two bounds will be encoded as -1e+6 and 1e+6, and all the
> rest will look like plain integers).

Yes.  See above re:

 char *s;
 garray_getfloatarray(a,size,(float **)&s);

>>  (D) it's not really portable (byte order problems with load/save)
> 
> byte order problems won't happen with floats saved as text. they will
> happen with floats saved as binary. they will also happen with UCS-2
> text saved as two floats per code point (no matter how you save the
> floats), but if you use UTF-8 instead, or if you use
> one-float-per-codepoint, that aspect will be safe.

No.  See above.  Messing about with typecasts is very
implementation-dependent, and afaik IEEE-754 doesn't define how its
components are to be implemented, only the formal criteria an
implementation must satisfy.

>> 2) If otoh you let the array remain a t_float* and just assign the
>> floats byte ((unsigned) char) or even wide character (wchar) values,
>> then:
>>  (A) you potentially waste a lot of memory
>> (strlen(str)*(sizeof(float)-1) bytes)
> 
> In 2011, wasting a lot of RAM is not a problem. Wasting too much RAM can
> be a problem, and that's very relative, as quite often, the solution is
> to wait until RAM is less expensive. I like the idea of not wasting any
> RAM, but I recognise that this is because I got used to think about ways
> to reduce waste, not because it's always good to worry about it.
> 
> Text is usually a lot smaller than video. It's not uncommon for me to
> store a buffer of 64 frames of video in colour. In 640x480, that's over
> 55 megs, and that's tiny compared to the total amount of RAM the
> computer has. How often do you need that much text at once in RAM ?

Stipulated for most purposes.  Taking ratts as an example, the CMU
dictionary is only 3.5M, the beep dictionary is 7.6M.  The non-free
German dictionary BOMP is still only 9.1M.  I agree that none of this is
going to "make the cabbage any fatter", as a saying here goes.  In other
work, I need much more data.  The morphology transducer I use is 153M
stored offline. (and more at runtime).  A simple word trigram model
bootstrapped from a decent sized corpus can run into the hundreds of MB
(the little one I have on hand in only 26MB) ... have a look at the
google n-grams for an idea of where that leads when you add lots more
data.  Basically, the moral is: mixing Zipf's law and polynomial growth
with respect to vocabulary size (e.g. n-gram models) can get you in a
deep hole very very quickly.  fwiw, the raw text of the whole corpus I
work with these days runs about 1G.  A single file with all intermediate
data can easily run over 400MB.  I really wouldn't want to go to N*4
there...

>>  (C) if you really want to store your string data in an array, you can
>> use [str] or [pdstring] together with e.g. [tabdump] and [tabset] from
>> zexy, which just makes the conversion overhead explicit.
> 
> GridFlow's grids support the byte format (unsigned char). This is one of
> the six allowed grid formats, and perhaps the 2nd most used (after
> signed int).

But GridFlow isn't vanilla either.

>> I think there are workarounds for both techniques, but not without
>> patching the pd core code, and if we're going to patch the core code,
>> we might as well take a patch that does the job "right" (i.e.
>> Martin's)...
> 
> If all of this can work as externals without hairy workarounds, then you
> don't need to be obsessing about patching pd's core code, and that's a
> good thing, especially if you aim to be patching vanilla's.

I think it probably can, but it's likely to amount to dependence on more
than offered by just pd-vanilla (e.g. GridFlow, Martin's patch, etc.).
Also, I have yet to come up with a satisfying way to make an easily
extensible string-handling library for pd -- anyone wanting to handle
strings in other externals should ideally also have access to some
common string handling API, without having to resort to the pd-patch
level, and the byte-values-in-floats approach just doesn't handle that
well.  I suppose I could make an array-like external specifically for
handling string buffers, but that's essentially what Martin's patch
does, so why re-invent the wheel?

marmosets,
	Bryan

-- 
Bryan Jurish                       "There is *always* one more bug."
jurish at uni-potsdam.de       -Lubarsky's Law of Cybernetic Entomology