[PD] standard encoding of pd files in each system?

Tue Feb 1 01:03:22 CET 2011

moin again,

On 2011-01-30 17:07:44, Mathieu Bouchard <matju at artengine.ca> appears to
have written:
> On Sat, 29 Jan 2011, Bryan Jurish wrote:
>> In this scenario, we're blatantly re-casting the array's (t_float*)
>> into a (char*) and reading/writing raw bytes.
> 
> Ok, I thought you were going to write one codepoint per float and not do
> any reinterpret_cast.
> 
> I'd advise against relying on reinterpret_cast hacks like this, and
> instead, add support for other number types in the t_array struct and
> supporting functions. In that case it'd become significantly closer to
> struct Grid, but without the support for multiple dimensions of indices.

I agree that sounds like the best way within pd-vanilla as it appears
currently to be constructed, but I'm not at all sure about how to go
about it.  And up until ca. 11:14 AM this morning, I had semi-major
post-catholic guilt attacks every time I even thought about doing
something involving a computer keyboard that didn't directly involve
either my dissertation or my day job.  Happily, the former is out of my
hands now (fanfare, please :-)

> This change could also prevent wasting half of the t_array memory when
> storing floats on 64-bit computers, which is currently a good source of
> ridicule.

Agreed.

>> Yes.  See above re:
>> char *s;
>> garray_getfloatarray(a,size,(float **)&s);
> 
> This is "deprecated" since 0.41, but really, this is a function call
> that never ever worked properly in 64-bit mode. You need to use
> garray_getfloatwords instead, which returns a t_word.

Yes, I did see that the underlying data was stored as a t_word; I only
briefly re-grepped the sources when composing my mail... I haven't
really dedicated a whole lot of time to this attempt, as I think you can
probably tell (bad hacker, no biscuit...)

>> No.  See above.  Messing about with typecasts is very
>> implementation-dependent, and afaik IEEE-754 doesn't define how its
>> components are to be implemented, only the formal criteria an
>> implementation must satisfy.
> 
> If you (or pd) never actually read the contents as float values in your
> use of reinterpret_cast to store, it doesn't matter, as you're doing
> nothing with float* that may depend on the difference between, say,
> float* and char*.

True.  But while I can guarantee this for "string-like" operations, I
can't seem to finagle it for pd, which insists on treating arrays (at
least those defined at the patch level) as t_float[]s (looks like the
culprit here is garray_save() calling binbuf_addv() to buffer the array
data, which assumedly gets dumped to the file at some point via
binbuf_gettext(), which calls atom_string() which ends up dumping the
float with the sprintf() "%g" format, i.e. truncating... but we all knew
that already, right?

> But you're not supposed to be using float* anymore, just t_word*, if you
> want to continue with the reinterpret_cast hack.

reinterpret_cast<> doesn't exist.

there's only (char*)expr  (cf. Kernighan & Ritchie, 1988)

:-P

sorry for being pedantic; you are of course correct that
reinterpret_cast<> is the C++ equivalent for what I'm doing; I just
think it's too much to type, which is yet another thing I dislike about
C++ ...

anyhoo, ok: I can (ab)use typecasts using (t_word*) instead of
(t_float*).  My original gripes (1C) and (1D) still hold: this breaks on
save/load of patches, if "string" data is to be saved with its array.  I
can of course say "not my problem" and leave it at that, but there are
still points (1A) and (1B).  (1A) is really just a convenience issue
(typecasting) -- if that was the only issue, I'd already have included
the code in [pdstring].  (1B) is the string-length issue, and is much
hairier.  We can't get string length and array length always to jive,
and if I add an extra object to store string length (and maybe other
properties), then why don't I just dump the string data as a (char*)
into that object, and leave g_array out of it entirely?

Martin's patch does just this: he adds a (length,data) pair struct
t_string and a t_string* to the t_word struct.  I suppose I could
divorce the underlying struct from t_word, wrap it into a new object,
bind a symbol, etc etc.... I still think the idea of using arrays for
strings is intriguing, not least for the sheer amount of abuse potential
arising from combining text bytes and audio signals in the same
arrays... so no, I guess I don't want to drop the g_array idea entirely
(apologies for answering my own question)...

>> deep hole very very quickly.  fwiw, the raw text of the whole corpus I
>> work with these days runs about 1G.  A single file with all intermediate
>> data can easily run over 400MB.  I really wouldn't want to go to N*4
>> there...
> 
> 4096 MB of DDR3 RAM is currently 37,99 $, going downhill. So, even with
> N*8, it doesn't look like the end of the word. (l)

grr.... yes yes, point taken, but I find it horribly nasty to waste more
than half of the memory allocated (ok, the strings-as-lists-of-floats
waste even more, but that's explicit and open about its hackery; putting
byte values into floats under the hood and calling the result "string"
would be cunning, devious, and underhanded hackery... or something like
that)

Now we're on to method (2).  The show-stopper for me here is argument
(2B): external APIs.  Under this method, every time I want my pd
"string" as a C string, I have to explicitly convert it, and vice versa,
which takes additional buffers, possibly (re-)allocations, and O(N)
time.  This is all likely to happen only at the control level, so maybe
that's not system critical either.  Being able to easily incorporate
external string-processing APIs (e.g. the C library string handling
routines) is a pretty big desideratum for a string handling mechanism,
im(ns)ho, so I'd have to build in some shared conversion routines,
buffer structures, etc. ... but also to export them beyond the confines
of a single external, which I haven't tested at all yet, and am not even
sure if is realistic to think about, except maybe as a static library or
code base... so the whole method (2) begins to look pretty baroque as
soon as it passes out of "proof-of-concept" and towards "useable API".

> But I'd still advocate adding multiple number type support to arrays.

Sounds like the best approach, agreed.

>>> GridFlow's grids support the byte format (unsigned char). This is one of
>>> the six allowed grid formats, and perhaps the 2nd most used (after
>>> signed int).
>> But GridFlow isn't vanilla either.
> 
> How many solutions do you want to reject ?

n-1, for some natural number n.  sorry, can't be more specific yet.

Honestly, if I had a pressing need for handling large-ish amounts of
text data in pd, I would probably look to GridFlow.  As it is, I usually
wind up trying to get all my string processing done outside of pd, and
passing the data back and forth via OSC or (brace yourself) the
filesystem, where the "strings" wind up as symbols, and put a good deal
of stress on pd's symbol table, but hey... it explodes only very rarely...

>> I think it probably can, but it's likely to amount to dependence on more
>> than offered by just pd-vanilla (e.g. GridFlow, Martin's patch, etc.).
> 
> Duh. Now do you really think you're saving yourself so much work by not
> installing easy-to-install software ? On Ubuntu and OSX, you can install
> both Martin's blobs and GridFlow with not much more than two clicks.

I have installed Martin's blobs.  It involved only a single patch to the
pd core and a re-compilation.  No big deal.  Last time I tried to
install GridFlow (this was years ago), I was bitten by many (potential)
dependencies and an old system, and gave up.  I should give it a whirl
again; many of its features I've heard you mention on the list at one
time or another sound very useful indeed.

I am not really concerned about the number of clicks it takes to install
software (I'm happy with (./configure; make; make install) myself); what
I am concerned about is the *portability* of my own software.  If I were
to work on yet another string handler for pd, I'd like to make sure that
it's got as wide a potential user base as possible.  Not everyone uses
pd-extended.

> But I was mentioning GridFlow just to tell you what's in there. From
> there, not only you can decide to use GridFlow, but if you decide to
> instead modify Pd, you can look at how GridFlow does it : isn't that
> interesting ?

It is, although I usually try to avoid mucking about in other people's code.

>> I suppose I could make an array-like external specifically for
>> handling string buffers, but that's essentially what Martin's patch
>> does, so why re-invent the wheel?
> 
> I think that how solutions can differ in the most notable way, is about
> where the string is actually stored, how you can keep one, how long you
> can keep it (when does it get deleted) and can you make a list of
> strings without having one [str] object per element. Do you agree ?

I'd say those are good differentia, yes.  From where I'm standing, I'd
put memory footprint and compatibility with existing 3rd-party APIs
(e.g. conversion to/from (char*)) at the top the list, and Martin's
strings fulfill those criteria admirably.  Persistence is worth thinking
about, but if you can get to/from (char*), it's pretty easy to roll your
own persistence code if need be.

The list-of-strings issue you brought up is very interesting indeed; I'm
almost tempted to push that into a general discussion of nested data
structures, but I think we've already drawn this thread far enough OT ;-)

marmosets,
	Bryan

-- 
Bryan Jurish                       "There is *always* one more bug."
jurish at uni-potsdam.de       -Lubarsky's Law of Cybernetic Entomology