[PD] locales for Pd WAS: japanese encoded chars in PD

Thu Feb 12 20:22:22 CET 2009

On Feb 12, 2009, at 4:40 AM, Bryan Jurish wrote:

> moin Hans, moin all,
>
> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org>  
> appears to
> have written:
>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>> for me, pd *does* display utf-8
>>> strings correctly in message boxes (tested with umlauts äöü, as  
>>> well as
>>> Greek &pi;&delta;
>>
>> Hmm, I am not sure that UTF-8 really is well supported.  Some chars  
>> get
>> thru, but many don't.  Here's an example.  I typed these chars in a
>> UTF-8 text editor as an png and a pd patch.  Not quite the same.
>
> ... I'm not really sure what (if anything) we can conclude from this.
> Maybe the text editor is making UTF-8 out of the keyboard input?   
> The Pd
> patch itself is most cetainly not UTF-8 encoded, which makes me  
> suspect
> that either (a) Pd is dropping non-printing shift bytes (IOhannes has
> pointed out similar goofiness in t_binbuf, but I thought it was only
> restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character  
> codes
> at all (whether this is Tk's fault or a system configuration issue is
> another question).  At least the latter should be testable with a few
> quick wish hacks...

Pd does seem to measure the bytes of the string, measuring the UTF-8  
shift bytes as chars.  For exmaple, in barf-both.pd, the message box  
of the utf-8 example is much longer than the text inside, while with  
the latin1, it is the correct size.

I don't know if you have followed Pd-devel 0.41.4 at all, but I have  
gotten to the point where the GUI is 100% Tcl/Tk so playing with this  
stuff should be a lot easier.  Check out the branch, if you would like  
to try things.

>>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an  
>>> odd
>>> error message from Pd though:
>>>
>>> Pd: buffer space wasn't sufficient for long GUI string
>>> (repeated 3 times)
>>
>> I am guessing that the above error comes from the fact that Pd is
>> written for latin1 where every char is always 1 byte, so sending  
>> UTF-8
>> could confuse things, since UTF-8 can have multi-byte chars.
>
> Kinda; but why is it only the presence of *latin-1* message boxes that
> cause complaints about "long GUI strings" (try deleting the utf-8
> message box & reloading: the error disappears).  I think an error is
> certainly justified in this case (we're feeding a latin-1 encoded
> message box to a Pd using a UTF-8 locale); I was just surprised by the
> form the error took ;-)

I think that Tcl/Tk tries to guess the locale of the data coming in  
from the network socket, then translate it to UTF-8 and back.  Some of  
the weirdness we are seeing could be related to that.  In Pd-devel,  
its much clearer, so it would be straightforward to play with this  
encoding translation stuff, and perhaps turn it off.  Ideally we could  
have UTF-8 coming from Pd so that Tk doesn't need to do any  
translation.  That could speed up things like array/graph redrawing.

>>> I don't know for sure, but I suspect one problem might be in the
>>> interpretation of user input
>>
>> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
>> that is no problem.
>
> Hmm... not sure what you mean by "natively" here... I mean, Perl uses
> UTF-8 as its "native" string encoding, but you can still manipulate  
> byte
> strings, read & write files etc in other encodings too.

Yes, same idea.  Internally, Tcl/Tk is using UTF-8, but it can freely  
translate between other encodings.

> If we're
> talking about user input and the Pd GUI, I think the main issue is how
> keyboard input is captured by Tk and passed on to Pd.  If the keyboard
> input is being grabbed by Tk bind()ing KeyPress events, then maybe we
> just need to edit that bind() call... looks like the KeyPress relevant
> "%"-substitutions are (from the Tk bind() manpage):
>
> %k - The keycode field from the event. Valid only for KeyPress and
> KeyRelease events.
>
> %A - Substitutes the UNICODE character corresponding to the event, or
> the empty string if the event does not correspond to a UNICODE  
> character
> (e.g. the shift key was pressed). XmbLookupString (or XLookupString  
> when
> input method support is turned off) does all the work of translating
> from the event to a UNICODE character. Valid only for KeyPress and
> KeyRelease events.
>
> %K - The keysym corresponding to the event, substituted as a textual
> string. Valid only for KeyPress and KeyRelease events.
>
> %N - The keysym corresponding to the event, substituted as a decimal
> number. Valid only for KeyPress and KeyRelease events.
>
> ... so if we're lucky, we can just replace "%k" with "%A" and all will
> be good... except for file I/O, which will likely still be done at a  
> raw
> byte level.  At this point, all "pure" latin-1 patches will proceed to
> break (maybe just display problems, maybe more serious).  If we say
> we're going whole-hog utf-8, we can say that it's the user's problem  
> to
> recode any such files (e.g. with iconv or recode; I'm happy to help  
> out
> with a few scripts); otherwise we might want to do something paranoid
> and try to guess a patch's encoding when it's loaded.  Or we use
> locale-dependent functions, but that makes sharing patches harder
> between people using different locales.  Or we use the XML-style
> solution and just save the encoding to use in the patch header ;-)

Yeah, this would be a good thing to rewrite.  The canvas_key code is  
definitely in need of refactoring anyway.  Pd has never really  
supported latin1 or any encoding besides ASCII, so I think we should  
just aim to make everything UTF-8, then make conversion utilities like  
you mentioned.

>>> bash$ export LC_CTYPE=en_DK.UTF-8
>>> bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays  
>>> incorrectly
>>>
>>> bash$ export LC_CTYPE=en_DK.ISO-8859-1
>>> bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok
>>>
>>> If it turns out to work well, we can of course make a trivial  
>>> "dummy"
>>> external out of it for use with "-lib" ...
>>
>> Hmm, I tried this on Mac OS X and it didn't seem to make a  
>> difference.
>> Perhaps its a platform issue, though on this level, Mac OS X is very
>> much BSD, so I think it should work.
>
> The locale strategy also depends on what locales your system has
> installed.  Here (linux/debian), I can see which locales are  
> installed with:
>
>   bash$ locale -a
>
> ... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
> been installed ...

I was using en_US.UTF-8.  It seems to me that there is an extra dash  
in your locale.  On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1   
On debian/stable, it tells me en_US.iso88591.  Does every system have  
different names for the latin1?  Arg....  I tried a bunch of  
variations of the locale and LANG and LC_CTYPE on Mac OS X, but I  
couldn't get the barf-both.pd to look different.

.hc

>
>
> marmosets,
> 	Bryan
>
> -- 
> Bryan Jurish                           "There is *always* one more  
> bug."
> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic  
> Entomology

----------------------------------------------------------------------------

As we enjoy great advantages from inventions of others, we should be  
glad of an opportunity to serve others by any invention of ours; and  
this we should do freely and generously.         - Benjamin Franklin