[PD] request for objections: any2string -> unsigned char

Fri Jan 16 11:01:14 CET 2009

moin again all,

On 2009-01-15 20:37:12, Mathieu Bouchard <matju at artengine.ca> appears to
have written:
> On Thu, 15 Jan 2009, IOhannes m zmoelnig wrote:
> 
>> so does anybody object to use an "unsigned" type rather than a signed
>> one?
>> expanding "uchar" to "uint" or whatever is no-work on the Pd-side of
>> things.
> 
> It's not that, it's that if you have ü (u umlaut) taken from a UTF-8
> file, then do you treat is as 195 188, or as 252 ? That is, is it
> predominantly two bytes, or predominantly one character ?

To clarify: my position is that the most fundamental representation is
the raw byte string, so a UTF-8 'ü' would be represented as
bytes(encode("utf8",'ü'))={195,188}.  Nothing stands in the way of
parsing unicode codepoint values from such a representation, to get
unicode_chars("utf8",{195,188})={252}.

OTOH, if the file were known to be encoded in latin-1, we'd have
bytes(encode("latin1",'ü'))={252}.  Without knowing the encoding of the
source, there's no reliable way to determine which unicode codepoints
its raw bytes are representing (or even if the powers that be have seen
fit to define such codepoints), so I would argue against making unicode
the *only* available internal representation.

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology