[PD] locales for Pd WAS: japanese encoded chars in PD

Fri Feb 13 03:14:20 CET 2009

On Thu, 12 Feb 2009, Bryan Jurish wrote:

> morning all,
>
> On 2009-02-12 20:22:22, Hans-Christoph Steiner <hans at eds.org> appears to
> have written:
>>> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org> appears to
>>> have written:
>>>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>>>> for me, pd *does* display utf-8
>>>>> strings correctly in message boxes (tested with umlauts äöü, as well as
>>>>> Greek &pi;&delta;
>>>>
>>>> Hmm, I am not sure that UTF-8 really is well supported.  Some chars get
>>>> thru, but many don't.  Here's an example.  I typed these chars in a
>>>> UTF-8 text editor as an png and a pd patch.  Not quite the same.
>>>
>>> ... I'm not really sure what (if anything) we can conclude from this.
>>> Maybe the text editor is making UTF-8 out of the keyboard input?  The Pd
>>> patch itself is most cetainly not UTF-8 encoded, which makes me suspect
>>> that either (a) Pd is dropping non-printing shift bytes (IOhannes has
>>> pointed out similar goofiness in t_binbuf, but I thought it was only
>>> restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character codes
>>> at all (whether this is Tk's fault or a system configuration issue is
>>> another question).  At least the latter should be testable with a few
>>> quick wish hacks...
>>
>> Pd does seem to measure the bytes of the string, measuring the UTF-8
>> shift bytes as chars.  For exmaple, in barf-both.pd, the message box of
>> the utf-8 example is much longer than the text inside, while with the
>> latin1, it is the correct size.
>
> yup.
>
>> I don't know if you have followed Pd-devel 0.41.4 at all, but I have
>> gotten to the point where the GUI is 100% Tcl/Tk so playing with this
>> stuff should be a lot easier.  Check out the branch, if you would like
>> to try things.
>
> soon.
>
>>>>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
>>>>> error message from Pd though:
>>>>>
>>>>> Pd: buffer space wasn't sufficient for long GUI string
>>>>> (repeated 3 times)
>>>>
>>>> I am guessing that the above error comes from the fact that Pd is
>>>> written for latin1 where every char is always 1 byte, so sending UTF-8
>>>> could confuse things, since UTF-8 can have multi-byte chars.
>>>
>>> Kinda; but why is it only the presence of *latin-1* message boxes that
>>> cause complaints about "long GUI strings" (try deleting the utf-8
>>> message box & reloading: the error disappears).  I think an error is
>>> certainly justified in this case (we're feeding a latin-1 encoded
>>> message box to a Pd using a UTF-8 locale); I was just surprised by the
>>> form the error took ;-)
>>
>> I think that Tcl/Tk tries to guess the locale of the data coming in from
>> the network socket, then translate it to UTF-8 and back.  Some of the
>> weirdness we are seeing could be related to that.  In Pd-devel, its much
>> clearer, so it would be straightforward to play with this encoding
>> translation stuff, and perhaps turn it off.  Ideally we could have UTF-8
>> coming from Pd so that Tk doesn't need to do any translation.  That
>> could speed up things like array/graph redrawing.
>
> Are we certain that Tk is actually translating at all, and not just
> using some 8-bit default like latin-1 when it finds non-UTF-8 input?  I
> ask because that's what Perl does by default, a behavior which continues
> to give me headaches.  In Perl, each string has its own internal "utf8"
> flag which tells you whether Perl is currently thinking of that string
> as a raw byte-string in some unknown encoding or as a "native" (utf8)
> character string... I assume Tcl/Tk does something similar, but don't
> know how to test for this property there.

Here's the doc that I read on this topic, but it probably doesn't have the 
lvel of detail that you require:

http://tcl.tk/man/tcl8.5/TclCmd/fconfigure.htm#M8

As for Tk hacking for Pd, a big part of the pd-devel effort is to make the 
Tk GUI code readable, and even extendable!  Feel free to hit me with 
questions, either here, or I am in #dataflow quite a bit these days.

.hc

>
>>>>> I don't know for sure, but I suspect one problem might be in the
>>>>> interpretation of user input
>>>>
>>>> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
>>>> that is no problem.
>>>
>>> Hmm... not sure what you mean by "natively" here... I mean, Perl uses
>>> UTF-8 as its "native" string encoding, but you can still manipulate byte
>>> strings, read & write files etc in other encodings too.
>>
>> Yes, same idea.  Internally, Tcl/Tk is using UTF-8, but it can freely
>> translate between other encodings.
>
> see above.
>
>>> If we're
>>> talking about user input and the Pd GUI, I think the main issue is how
>>> keyboard input is captured by Tk and passed on to Pd.  If the keyboard
>>> input is being grabbed by Tk bind()ing KeyPress events, then maybe we
>>> just need to edit that bind() call... looks like the KeyPress relevant
>>> "%"-substitutions are (from the Tk bind() manpage):
> [snip]
>
> ... I'm curious enough to try these out now... just have to dust off my
> long unused Tcl/Tk skills a bit ;-)
>
>>> ... so if we're lucky, we can just replace "%k" with "%A" and all will
>>> be good... except for file I/O, which will likely still be done at a raw
>>> byte level.  At this point, all "pure" latin-1 patches will proceed to
>>> break (maybe just display problems, maybe more serious).  If we say
>>> we're going whole-hog utf-8, we can say that it's the user's problem to
>>> recode any such files (e.g. with iconv or recode; I'm happy to help out
>>> with a few scripts); otherwise we might want to do something paranoid
>>> and try to guess a patch's encoding when it's loaded.  Or we use
>>> locale-dependent functions, but that makes sharing patches harder
>>> between people using different locales.  Or we use the XML-style
>>> solution and just save the encoding to use in the patch header ;-)
>>
>> Yeah, this would be a good thing to rewrite.  The canvas_key code is
>> definitely in need of refactoring anyway.  Pd has never really supported
>> latin1 or any encoding besides ASCII, so I think we should just aim to
>> make everything UTF-8, then make conversion utilities like you mentioned.
>
> I'll have a look, but always in the past I've been scared off whenever
> I've tried to look deeper into Pd's Tk side.
>
>>>>> bash$ export LC_CTYPE=en_DK.UTF-8
>>>>> bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly
>>>>>
>>>>> bash$ export LC_CTYPE=en_DK.ISO-8859-1
>>>>> bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok
>>>>>
>>>>> If it turns out to work well, we can of course make a trivial "dummy"
>>>>> external out of it for use with "-lib" ...
>>>>
>>>> Hmm, I tried this on Mac OS X and it didn't seem to make a difference.
>>>> Perhaps its a platform issue, though on this level, Mac OS X is very
>>>> much BSD, so I think it should work.
>>>
>>> The locale strategy also depends on what locales your system has
>>> installed.  Here (linux/debian), I can see which locales are installed
>>> with:
>>>
>>>   bash$ locale -a
>>>
>>> ... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
>>> been installed ...
>>
>> I was using en_US.UTF-8.  It seems to me that there is an extra dash in
>> your locale.  On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1  On
>> debian/stable, it tells me en_US.iso88591.  Does every system have
>> different names for the latin1?  Arg....  I tried a bunch of variations
>> of the locale and LANG and LC_CTYPE on Mac OS X, but I couldn't get the
>> barf-both.pd to look different.
>
> curioser and curioser.  I think on debian both "iso88591" and
> "ISO-8859-1" should work as charmaps.  Similary, both "utf8" and "UTF-8"
> ought to work.  The locale(1) manpage says:
>
>  FILES
>    /usr/share/i18n/SUPPORTED
>        List of supported values (and their associated encoding) for
>        the locale name.  This representation is recommended over
>        --all-locales one, due being the system wide supported values.
>
> ... /usr/share/i18n/SUPPORTED (and /etc/locale.gen) includes for example
> "ISO-8859-1", but not "iso88591".  `locale -a` on the other hand outputs
> "iso88591" but not "ISO-8859-1".  I'm not sure whether the relevant
> standard (ISO/IEC 9945 aka POSIX?) says anything about the form that
> charmap names have to take.  Looking at
> http://faqs.cs.uu.nl/na-dir/internationalization/iso-8859-1-charset.html,
> I find:
>
>  "Currently, each system vendor has his own set of locale names, which
> makes portability a bit problematic."
>
> Bummer.
>
> marmosets,
> 	Bryan
> -- 
> Bryan Jurish                           "There is *always* one more bug."
> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology
>

 	zen
 	   \
 	    \
 	     \[D[D[D[D