[PD] request for objections: any2string -> unsigned char

Bryan Jurish moocow at ling.uni-potsdam.de
Fri Jan 16 10:48:43 CET 2009


moin Mathieu, moin all,

On 2009-01-15 20:45:13, Mathieu Bouchard <matju at artengine.ca> appears to
have written:
> On Thu, 15 Jan 2009, Bryan Jurish wrote:
> 
>> byte-strings are IMHO the more basic representation (a
>> char* is still a char*, even in this post-unicode world).
> 
> What happened is that people switched to UTF-8 instead of some
> fixed-size encoding because many apps that assume that a character is a
> byte will work anyway. 

UTF-8 also does a pretty good job of compactly representing latin
character sets for natural language data, where non-ASCII characters
tend to be relatively infrequent anyways.  UTF-16 and UTF-32 are pretty
wasteful in these cases.  (Of course, I'm biting my own tail with this
point, since the [pdstring] representation is even more wasteful than
UTF-32 ;-)

> Just don't ask those apps to say how many
> characters there are in a string though. You have to pretend that all
> the "special" characters are pairs of characters instead (when they are
> not triplets).

Indeed.  Ugly but true.

> I gather that it'll take a long time before Pd gets unicode support...

I suspect you're right.

>> ... except if you're building rsp. reading a persistent index for a
>> large file, in which case tell() & seek() are likely to be a wee bit
>> faster than parsing and counting variable-length-encoded characters ...
> 
> right.

... or calling malloc(), or doing pretty much any other low-level fiddly
stuff ...

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology





More information about the Pd-list mailing list