[PD] locales for Pd WAS: japanese encoded chars in PD
Hans-Christoph Steiner
hans at eds.org
Thu Feb 12 20:22:22 CET 2009
On Feb 12, 2009, at 4:40 AM, Bryan Jurish wrote:
> moin Hans, moin all,
>
> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org>
> appears to
> have written:
>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>> for me, pd *does* display utf-8
>>> strings correctly in message boxes (tested with umlauts äöü, as
>>> well as
>>> Greek πδ
>>
>> Hmm, I am not sure that UTF-8 really is well supported. Some chars
>> get
>> thru, but many don't. Here's an example. I typed these chars in a
>> UTF-8 text editor as an png and a pd patch. Not quite the same.
>
> ... I'm not really sure what (if anything) we can conclude from this.
> Maybe the text editor is making UTF-8 out of the keyboard input?
> The Pd
> patch itself is most cetainly not UTF-8 encoded, which makes me
> suspect
> that either (a) Pd is dropping non-printing shift bytes (IOhannes has
> pointed out similar goofiness in t_binbuf, but I thought it was only
> restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character
> codes
> at all (whether this is Tk's fault or a system configuration issue is
> another question). At least the latter should be testable with a few
> quick wish hacks...
Pd does seem to measure the bytes of the string, measuring the UTF-8
shift bytes as chars. For exmaple, in barf-both.pd, the message box
of the utf-8 example is much longer than the text inside, while with
the latin1, it is the correct size.
I don't know if you have followed Pd-devel 0.41.4 at all, but I have
gotten to the point where the GUI is 100% Tcl/Tk so playing with this
stuff should be a lot easier. Check out the branch, if you would like
to try things.
>>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an
>>> odd
>>> error message from Pd though:
>>>
>>> Pd: buffer space wasn't sufficient for long GUI string
>>> (repeated 3 times)
>>
>> I am guessing that the above error comes from the fact that Pd is
>> written for latin1 where every char is always 1 byte, so sending
>> UTF-8
>> could confuse things, since UTF-8 can have multi-byte chars.
>
> Kinda; but why is it only the presence of *latin-1* message boxes that
> cause complaints about "long GUI strings" (try deleting the utf-8
> message box & reloading: the error disappears). I think an error is
> certainly justified in this case (we're feeding a latin-1 encoded
> message box to a Pd using a UTF-8 locale); I was just surprised by the
> form the error took ;-)
I think that Tcl/Tk tries to guess the locale of the data coming in
from the network socket, then translate it to UTF-8 and back. Some of
the weirdness we are seeing could be related to that. In Pd-devel,
its much clearer, so it would be straightforward to play with this
encoding translation stuff, and perhaps turn it off. Ideally we could
have UTF-8 coming from Pd so that Tk doesn't need to do any
translation. That could speed up things like array/graph redrawing.
>>> I don't know for sure, but I suspect one problem might be in the
>>> interpretation of user input
>>
>> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
>> that is no problem.
>
> Hmm... not sure what you mean by "natively" here... I mean, Perl uses
> UTF-8 as its "native" string encoding, but you can still manipulate
> byte
> strings, read & write files etc in other encodings too.
Yes, same idea. Internally, Tcl/Tk is using UTF-8, but it can freely
translate between other encodings.
> If we're
> talking about user input and the Pd GUI, I think the main issue is how
> keyboard input is captured by Tk and passed on to Pd. If the keyboard
> input is being grabbed by Tk bind()ing KeyPress events, then maybe we
> just need to edit that bind() call... looks like the KeyPress relevant
> "%"-substitutions are (from the Tk bind() manpage):
>
> %k - The keycode field from the event. Valid only for KeyPress and
> KeyRelease events.
>
> %A - Substitutes the UNICODE character corresponding to the event, or
> the empty string if the event does not correspond to a UNICODE
> character
> (e.g. the shift key was pressed). XmbLookupString (or XLookupString
> when
> input method support is turned off) does all the work of translating
> from the event to a UNICODE character. Valid only for KeyPress and
> KeyRelease events.
>
> %K - The keysym corresponding to the event, substituted as a textual
> string. Valid only for KeyPress and KeyRelease events.
>
> %N - The keysym corresponding to the event, substituted as a decimal
> number. Valid only for KeyPress and KeyRelease events.
>
> ... so if we're lucky, we can just replace "%k" with "%A" and all will
> be good... except for file I/O, which will likely still be done at a
> raw
> byte level. At this point, all "pure" latin-1 patches will proceed to
> break (maybe just display problems, maybe more serious). If we say
> we're going whole-hog utf-8, we can say that it's the user's problem
> to
> recode any such files (e.g. with iconv or recode; I'm happy to help
> out
> with a few scripts); otherwise we might want to do something paranoid
> and try to guess a patch's encoding when it's loaded. Or we use
> locale-dependent functions, but that makes sharing patches harder
> between people using different locales. Or we use the XML-style
> solution and just save the encoding to use in the patch header ;-)
Yeah, this would be a good thing to rewrite. The canvas_key code is
definitely in need of refactoring anyway. Pd has never really
supported latin1 or any encoding besides ASCII, so I think we should
just aim to make everything UTF-8, then make conversion utilities like
you mentioned.
>>> bash$ export LC_CTYPE=en_DK.UTF-8
>>> bash$ pd uselocale.pd barf-both.pd ##-- latin-1 displays
>>> incorrectly
>>>
>>> bash$ export LC_CTYPE=en_DK.ISO-8859-1
>>> bash$ pd uselocale.pd barf-both.pd ##-- all displays ok
>>>
>>> If it turns out to work well, we can of course make a trivial
>>> "dummy"
>>> external out of it for use with "-lib" ...
>>
>> Hmm, I tried this on Mac OS X and it didn't seem to make a
>> difference.
>> Perhaps its a platform issue, though on this level, Mac OS X is very
>> much BSD, so I think it should work.
>
> The locale strategy also depends on what locales your system has
> installed. Here (linux/debian), I can see which locales are
> installed with:
>
> bash$ locale -a
>
> ... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
> been installed ...
I was using en_US.UTF-8. It seems to me that there is an extra dash
in your locale. On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1
On debian/stable, it tells me en_US.iso88591. Does every system have
different names for the latin1? Arg.... I tried a bunch of
variations of the locale and LANG and LC_CTYPE on Mac OS X, but I
couldn't get the barf-both.pd to look different.
.hc
>
>
> marmosets,
> Bryan
>
> --
> Bryan Jurish "There is *always* one more
> bug."
> jurish at ling.uni-potsdam.de -Lubarsky's Law of Cybernetic
> Entomology
----------------------------------------------------------------------------
As we enjoy great advantages from inventions of others, we should be
glad of an opportunity to serve others by any invention of ours; and
this we should do freely and generously. - Benjamin Franklin
More information about the Pd-list
mailing list