[PD] locales for Pd WAS: japanese encoded chars in PD

Thu Feb 12 22:44:46 CET 2009

morning all,

On 2009-02-12 20:22:22, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
>> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org> appears to
>> have written:
>>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>>> for me, pd *does* display utf-8
>>>> strings correctly in message boxes (tested with umlauts äöü, as well as
>>>> Greek &pi;&delta;
>>>
>>> Hmm, I am not sure that UTF-8 really is well supported.  Some chars get
>>> thru, but many don't.  Here's an example.  I typed these chars in a
>>> UTF-8 text editor as an png and a pd patch.  Not quite the same.
>>
>> ... I'm not really sure what (if anything) we can conclude from this.
>> Maybe the text editor is making UTF-8 out of the keyboard input?  The Pd
>> patch itself is most cetainly not UTF-8 encoded, which makes me suspect
>> that either (a) Pd is dropping non-printing shift bytes (IOhannes has
>> pointed out similar goofiness in t_binbuf, but I thought it was only
>> restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character codes
>> at all (whether this is Tk's fault or a system configuration issue is
>> another question).  At least the latter should be testable with a few
>> quick wish hacks...
> 
> Pd does seem to measure the bytes of the string, measuring the UTF-8
> shift bytes as chars.  For exmaple, in barf-both.pd, the message box of
> the utf-8 example is much longer than the text inside, while with the
> latin1, it is the correct size.

yup.

> I don't know if you have followed Pd-devel 0.41.4 at all, but I have
> gotten to the point where the GUI is 100% Tcl/Tk so playing with this
> stuff should be a lot easier.  Check out the branch, if you would like
> to try things.

soon.

>>>> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
>>>> error message from Pd though:
>>>>
>>>> Pd: buffer space wasn't sufficient for long GUI string
>>>> (repeated 3 times)
>>>
>>> I am guessing that the above error comes from the fact that Pd is
>>> written for latin1 where every char is always 1 byte, so sending UTF-8
>>> could confuse things, since UTF-8 can have multi-byte chars.
>>
>> Kinda; but why is it only the presence of *latin-1* message boxes that
>> cause complaints about "long GUI strings" (try deleting the utf-8
>> message box & reloading: the error disappears).  I think an error is
>> certainly justified in this case (we're feeding a latin-1 encoded
>> message box to a Pd using a UTF-8 locale); I was just surprised by the
>> form the error took ;-)
> 
> I think that Tcl/Tk tries to guess the locale of the data coming in from
> the network socket, then translate it to UTF-8 and back.  Some of the
> weirdness we are seeing could be related to that.  In Pd-devel, its much
> clearer, so it would be straightforward to play with this encoding
> translation stuff, and perhaps turn it off.  Ideally we could have UTF-8
> coming from Pd so that Tk doesn't need to do any translation.  That
> could speed up things like array/graph redrawing.

Are we certain that Tk is actually translating at all, and not just
using some 8-bit default like latin-1 when it finds non-UTF-8 input?  I
ask because that's what Perl does by default, a behavior which continues
to give me headaches.  In Perl, each string has its own internal "utf8"
flag which tells you whether Perl is currently thinking of that string
as a raw byte-string in some unknown encoding or as a "native" (utf8)
character string... I assume Tcl/Tk does something similar, but don't
know how to test for this property there.

>>>> I don't know for sure, but I suspect one problem might be in the
>>>> interpretation of user input
>>>
>>> I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so
>>> that is no problem.
>>
>> Hmm... not sure what you mean by "natively" here... I mean, Perl uses
>> UTF-8 as its "native" string encoding, but you can still manipulate byte
>> strings, read & write files etc in other encodings too.
> 
> Yes, same idea.  Internally, Tcl/Tk is using UTF-8, but it can freely
> translate between other encodings.

see above.

>> If we're
>> talking about user input and the Pd GUI, I think the main issue is how
>> keyboard input is captured by Tk and passed on to Pd.  If the keyboard
>> input is being grabbed by Tk bind()ing KeyPress events, then maybe we
>> just need to edit that bind() call... looks like the KeyPress relevant
>> "%"-substitutions are (from the Tk bind() manpage):
[snip]

... I'm curious enough to try these out now... just have to dust off my
long unused Tcl/Tk skills a bit ;-)

>> ... so if we're lucky, we can just replace "%k" with "%A" and all will
>> be good... except for file I/O, which will likely still be done at a raw
>> byte level.  At this point, all "pure" latin-1 patches will proceed to
>> break (maybe just display problems, maybe more serious).  If we say
>> we're going whole-hog utf-8, we can say that it's the user's problem to
>> recode any such files (e.g. with iconv or recode; I'm happy to help out
>> with a few scripts); otherwise we might want to do something paranoid
>> and try to guess a patch's encoding when it's loaded.  Or we use
>> locale-dependent functions, but that makes sharing patches harder
>> between people using different locales.  Or we use the XML-style
>> solution and just save the encoding to use in the patch header ;-)
> 
> Yeah, this would be a good thing to rewrite.  The canvas_key code is
> definitely in need of refactoring anyway.  Pd has never really supported
> latin1 or any encoding besides ASCII, so I think we should just aim to
> make everything UTF-8, then make conversion utilities like you mentioned.

I'll have a look, but always in the past I've been scared off whenever
I've tried to look deeper into Pd's Tk side.

>>>> bash$ export LC_CTYPE=en_DK.UTF-8
>>>> bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly
>>>>
>>>> bash$ export LC_CTYPE=en_DK.ISO-8859-1
>>>> bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok
>>>>
>>>> If it turns out to work well, we can of course make a trivial "dummy"
>>>> external out of it for use with "-lib" ...
>>>
>>> Hmm, I tried this on Mac OS X and it didn't seem to make a difference.
>>> Perhaps its a platform issue, though on this level, Mac OS X is very
>>> much BSD, so I think it should work.
>>
>> The locale strategy also depends on what locales your system has
>> installed.  Here (linux/debian), I can see which locales are installed
>> with:
>>
>>   bash$ locale -a
>>
>> ... I would expect goofiness trying to use "en_DK.UTF-8" if it's not
>> been installed ...
> 
> I was using en_US.UTF-8.  It seems to me that there is an extra dash in
> your locale.  On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1  On
> debian/stable, it tells me en_US.iso88591.  Does every system have
> different names for the latin1?  Arg....  I tried a bunch of variations
> of the locale and LANG and LC_CTYPE on Mac OS X, but I couldn't get the
> barf-both.pd to look different.

curioser and curioser.  I think on debian both "iso88591" and
"ISO-8859-1" should work as charmaps.  Similary, both "utf8" and "UTF-8"
ought to work.  The locale(1) manpage says:

  FILES
    /usr/share/i18n/SUPPORTED
        List of supported values (and their associated encoding) for
        the locale name.  This representation is recommended over
        --all-locales one, due being the system wide supported values.

... /usr/share/i18n/SUPPORTED (and /etc/locale.gen) includes for example
"ISO-8859-1", but not "iso88591".  `locale -a` on the other hand outputs
"iso88591" but not "ISO-8859-1".  I'm not sure whether the relevant
standard (ISO/IEC 9945 aka POSIX?) says anything about the form that
charmap names have to take.  Looking at
http://faqs.cs.uu.nl/na-dir/internationalization/iso-8859-1-charset.html,
I find:

  "Currently, each system vendor has his own set of locale names, which
makes portability a bit problematic."

Bummer.

marmosets,
	Bryan
-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology