[PD] request for objections: any2string -> unsigned char

Mon Jan 19 16:15:22 CET 2009

morning again all,

On 2009-01-19 15:19:04, Martin Peach <martin.peach at sympatico.ca> appears
to have written:
> Bryan Jurish wrote:
>> well, without wanting to be trite, I have to say that think that "data
>> transmission" and "linguistic processing" are pretty much synonymous.
> 
> Pretty much but linguistic processing is happening at a higher level
> than data transmission, and the 'character' used in language may be
> represented by more than one data transmission 'character'.

Well, speaking as a linguist, I guess I have to say that I don't think
meaningful "linguistic processing" can really happen at the character
level (even though that's almost invariably the starting point for NLP
programs), but that's just me getting pedantically OT.  Sorry.

Of course you're right, and "characters" are intended to represent
linguistically salient units ("graphemes").

> With ASCII and its 8-bit relatives it's almost the same because the two
> kinds of character are the same, but unicode uses more than one data
> character per linguistic character.

Yup (well, sometimes it does... it's the curse of those darned
variable-length encodings again...)

> That's why I think there needs to be a distinction between the two types
> of 'string', and maybe two levels of objects to deal with them, like
> [unicode2byte].

I fully agree: we should distinguish between "byte strings" and
"(unicode) character strings".  As for converting between character- and
byte-strings, there are a whole slew of encoding pitfalls to watch out
for.  Converting between (say) a UCS-4 unicode character string and a
UTF-8 byte string is easy, but things get uglier if we want an old-style
8-bit encoding (other than latin-1) for the byte string...

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology