[PD] locales for Pd WAS: japanese encoded chars in PD

Thu Feb 12 10:40:23 CET 2009

moin Hans, moin all,

On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>> for me, pd *does* display utf-8
>> strings correctly in message boxes (tested with umlauts äöü, as well as
>> Greek &pi;&delta;
> 
> Hmm, I am not sure that UTF-8 really is well supported.  Some chars get
> thru, but many don't.  Here's an example.  I typed these chars in a
> UTF-8 text editor as an png and a pd patch.  Not quite the same.

... I'm not really sure what (if anything) we can conclude from this.
Maybe the text editor is making UTF-8 out of the keyboard input?  The Pd
patch itself is most cetainly not UTF-8 encoded, which makes me suspect
that either (a) Pd is dropping non-printing shift bytes (IOhannes has
pointed out similar goofiness in t_binbuf, but I thought it was only
restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character codes
at all (whether this is Tk's fault or a system configuration issue is
another question).  At least the latter should be testable with a few
quick wish hacks...

>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
>> error message from Pd though:
>>
>> Pd: buffer space wasn't sufficient for long GUI string
>> (repeated 3 times)
> 
> I am guessing that the above error comes from the fact that Pd is
> written for latin1 where every char is always 1 byte, so sending UTF-8
> could confuse things, since UTF-8 can have multi-byte chars.

Kinda; but why is it only the presence of *latin-1* message boxes that
cause complaints about "long GUI strings" (try deleting the utf-8
message box & reloading: the error disappears).  I think an error is
certainly justified in this case (we're feeding a latin-1 encoded
message box to a Pd using a UTF-8 locale); I was just surprised by the
form the error took ;-)

>> I don't know for sure, but I suspect one problem might be in the
>> interpretation of user input
> 
> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
> that is no problem.

Hmm... not sure what you mean by "natively" here... I mean, Perl uses
UTF-8 as its "native" string encoding, but you can still manipulate byte
strings, read & write files etc in other encodings too.  If we're
talking about user input and the Pd GUI, I think the main issue is how
keyboard input is captured by Tk and passed on to Pd.  If the keyboard
input is being grabbed by Tk bind()ing KeyPress events, then maybe we
just need to edit that bind() call... looks like the KeyPress relevant
"%"-substitutions are (from the Tk bind() manpage):

 %k - The keycode field from the event. Valid only for KeyPress and
KeyRelease events.

 %A - Substitutes the UNICODE character corresponding to the event, or
the empty string if the event does not correspond to a UNICODE character
(e.g. the shift key was pressed). XmbLookupString (or XLookupString when
input method support is turned off) does all the work of translating
from the event to a UNICODE character. Valid only for KeyPress and
KeyRelease events.

 %K - The keysym corresponding to the event, substituted as a textual
string. Valid only for KeyPress and KeyRelease events.

 %N - The keysym corresponding to the event, substituted as a decimal
number. Valid only for KeyPress and KeyRelease events.

... so if we're lucky, we can just replace "%k" with "%A" and all will
be good... except for file I/O, which will likely still be done at a raw
byte level.  At this point, all "pure" latin-1 patches will proceed to
break (maybe just display problems, maybe more serious).  If we say
we're going whole-hog utf-8, we can say that it's the user's problem to
recode any such files (e.g. with iconv or recode; I'm happy to help out
with a few scripts); otherwise we might want to do something paranoid
and try to guess a patch's encoding when it's loaded.  Or we use
locale-dependent functions, but that makes sharing patches harder
between people using different locales.  Or we use the XML-style
solution and just save the encoding to use in the patch header ;-)

>> bash$ export LC_CTYPE=en_DK.UTF-8
>> bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly
>>
>> bash$ export LC_CTYPE=en_DK.ISO-8859-1
>> bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok
>>
>> If it turns out to work well, we can of course make a trivial "dummy"
>> external out of it for use with "-lib" ...
> 
> Hmm, I tried this on Mac OS X and it didn't seem to make a difference. 
> Perhaps its a platform issue, though on this level, Mac OS X is very
> much BSD, so I think it should work.

The locale strategy also depends on what locales your system has
installed.  Here (linux/debian), I can see which locales are installed with:

   bash$ locale -a

... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
been installed ...

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology