[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

Bryan Jurish moocow at ling.uni-potsdam.de
Tue Feb 17 23:53:49 CET 2009


morning Hans, morning list,

So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the
board.  The TK side was easy (as Hans predicted); really just a call to
{fconfigure} in ::pd_connect::configure_socket.  I also set the output
encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes;
it's probably wisest to leave those encodings at the default (user's
current locale LC_CTYPE) for a release-like version.

The C side is much hairier.  I think I've got things basically working
(at least for message boxes and comments), but it has so far required
changes in:

FILE: g_editor.c
+ changed handling of <Key> events as passed to the C side to generate
UTF-8 symbol-strings rather than single-byte stringlets.

+ currently use sprintf("%C") to get the UTF-8 string for the codepoint
passed from Tk; a safer (and not too hard) way would be to pass the
actual UTF-8 string from Tk and just copy that: this would avoid the
m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below).  Another option
would be actually just writing (or borrowing) the code to generate UTF-8
strings from Unicode codepoints.  It's pretty simple stuff; I've still
got the guts of it somewhere (only written for latin-1 so far, but the
principle is the same for all codepoints).

FILE: m_pd.c
+ added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is an
ugly stinky nasty hack to get sprintf("%C") to output a UTF-8 encoded
string from an unicode codepoint int, as called by canvas_key() in
g_editor.c

FILE: g_rtext.c
+ added an 'else if' clause in rtext_key() to handle unicode codepoints
as values of the 'keynum' parameter.  should also be safe for any 8-bit
fixed-width encoding.

FILE: pd.tk
+ set system encoding, also output encoding for stdout, stderr to UTF-8

Attached is a screenshot and a test patch.  UTF-8 input from the
keyboard works with the test patch, and gets carried through properly to
the .pd file (and back on load).

I'd like to get symbol atoms working too (haven't tried yet), but there
are still some nasty buglets with comments and message boxes, mostly
that editing any multibyte characters is very tricky: looks like the Tk
point (cursor) and selection are expressed in characters, and Pd's C
side is still thinking in bytes, though I'm totally ignorant of where or
how that can be changed.  A non-critical buglet with the same cause
(probably) is that the C side is computing the required width for
message boxes based on byte lengths, not character lengths, so message
boxes containing multibyte characters look too wide.  I could live with
that, but the editing thing is a real pain...

I've attached a diff of my changes against branches/pd-devel/0.41.4/src
(please excuse commented-out debugging code), in case anyone wants to
try this stuff out.  Since it's not working, I'm reluctant to check
anything into the pd-devel/0.41.4 branch yet -- should I branch again
for a work in progress, or do we just pass diffs around for now?

marmosets,
	Bryan

On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>> On 2009-02-11 03:04:34, Hans-Christoph Steiner <hans at eds.org> appears to
>>> This is something that I would really like to have working properly in
>>> Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should support
>>> UTF-8 in Pd.  Anyone feel like trying to fix it?

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-utf8.pd
Type: application/puredata
Size: 567 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090217/f7c45d02/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-utf8.png
Type: image/png
Size: 11789 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090217/f7c45d02/attachment.png>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pd-devel-0.41.4-src.utf8-moo-2009-02-17.diff
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090217/f7c45d02/attachment.txt>


More information about the Pd-list mailing list