[PD] request for objections: any2string -> unsigned char
moocow at ling.uni-potsdam.de
Fri Jan 16 10:48:43 CET 2009
moin Mathieu, moin all,
On 2009-01-15 20:45:13, Mathieu Bouchard <matju at artengine.ca> appears to
> On Thu, 15 Jan 2009, Bryan Jurish wrote:
>> byte-strings are IMHO the more basic representation (a
>> char* is still a char*, even in this post-unicode world).
> What happened is that people switched to UTF-8 instead of some
> fixed-size encoding because many apps that assume that a character is a
> byte will work anyway.
UTF-8 also does a pretty good job of compactly representing latin
character sets for natural language data, where non-ASCII characters
tend to be relatively infrequent anyways. UTF-16 and UTF-32 are pretty
wasteful in these cases. (Of course, I'm biting my own tail with this
point, since the [pdstring] representation is even more wasteful than
> Just don't ask those apps to say how many
> characters there are in a string though. You have to pretend that all
> the "special" characters are pairs of characters instead (when they are
> not triplets).
Indeed. Ugly but true.
> I gather that it'll take a long time before Pd gets unicode support...
I suspect you're right.
>> ... except if you're building rsp. reading a persistent index for a
>> large file, in which case tell() & seek() are likely to be a wee bit
>> faster than parsing and counting variable-length-encoded characters ...
... or calling malloc(), or doing pretty much any other low-level fiddly
Bryan Jurish "There is *always* one more bug."
jurish at ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology
More information about the Pd-list