[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
Hans-Christoph Steiner
hans at eds.org
Thu Feb 19 18:43:49 CET 2009
This is good news! While the C changes aren't dead simple, they are
not bad. I think they could be slightly simplified. One thing that
would make it much easier to read the diff is if you create it without
whitespace changes. So like this:
svn diff -x -w
As for the Tcl changes, I think we can include those now in Pd-devel,
as long they can work ok with unchanged C code. Then once the new Tcl
GUI is included we can refactor the C side of things with things like
this. One other thing, it seems that the ASCII char are handled
differently than the UTF-8 chars in g_rtext.c, I think you could use
instead wcswidth(), mbstowcs() or other UTF-8 functions as described
in the UTF-8 FAQ
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
.hc
On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:
> morning Hans, morning list,
>
> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across
> the
> board. The TK side was easy (as Hans predicted); really just a call
> to
> {fconfigure} in ::pd_connect::configure_socket. I also set the output
> encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes;
> it's probably wisest to leave those encodings at the default (user's
> current locale LC_CTYPE) for a release-like version.
>
> The C side is much hairier. I think I've got things basically working
> (at least for message boxes and comments), but it has so far required
> changes in:
>
> FILE: g_editor.c
> + changed handling of <Key> events as passed to the C side to generate
> UTF-8 symbol-strings rather than single-byte stringlets.
>
> + currently use sprintf("%C") to get the UTF-8 string for the
> codepoint
> passed from Tk; a safer (and not too hard) way would be to pass the
> actual UTF-8 string from Tk and just copy that: this would avoid the
> m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below). Another option
> would be actually just writing (or borrowing) the code to generate
> UTF-8
> strings from Unicode codepoints. It's pretty simple stuff; I've still
> got the guts of it somewhere (only written for latin-1 so far, but the
> principle is the same for all codepoints).
>
> FILE: m_pd.c
> + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is
> an
> ugly stinky nasty hack to get sprintf("%C") to output a UTF-8 encoded
> string from an unicode codepoint int, as called by canvas_key() in
> g_editor.c
>
> FILE: g_rtext.c
> + added an 'else if' clause in rtext_key() to handle unicode
> codepoints
> as values of the 'keynum' parameter. should also be safe for any 8-
> bit
> fixed-width encoding.
>
> FILE: pd.tk
> + set system encoding, also output encoding for stdout, stderr to
> UTF-8
>
> Attached is a screenshot and a test patch. UTF-8 input from the
> keyboard works with the test patch, and gets carried through
> properly to
> the .pd file (and back on load).
>
> I'd like to get symbol atoms working too (haven't tried yet), but
> there
> are still some nasty buglets with comments and message boxes, mostly
> that editing any multibyte characters is very tricky: looks like the
> Tk
> point (cursor) and selection are expressed in characters, and Pd's C
> side is still thinking in bytes, though I'm totally ignorant of
> where or
> how that can be changed. A non-critical buglet with the same cause
> (probably) is that the C side is computing the required width for
> message boxes based on byte lengths, not character lengths, so message
> boxes containing multibyte characters look too wide. I could live
> with
> that, but the editing thing is a real pain...
>
> I've attached a diff of my changes against branches/pd-devel/0.41.4/
> src
> (please excuse commented-out debugging code), in case anyone wants to
> try this stuff out. Since it's not working, I'm reluctant to check
> anything into the pd-devel/0.41.4 branch yet -- should I branch again
> for a work in progress, or do we just pass diffs around for now?
>
> marmosets,
> Bryan
>
> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org>
> appears to
> have written:
>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>> On 2009-02-11 03:04:34, Hans-Christoph Steiner <hans at eds.org>
>>> appears to
>>>> This is something that I would really like to have working
>>>> properly in
>>>> Pd-devel. Tcl/Tk is natively UTF-8, so it seems that we should
>>>> support
>>>> UTF-8 in Pd. Anyone feel like trying to fix it?
>
> --
> Bryan Jurish "There is *always* one more
> bug."
> jurish at ling.uni-potsdam.de -Lubarsky's Law of Cybernetic
> Entomology
> <test-utf8.pd><test-utf8.png>Index: m_pd.c
> ===================================================================
> --- m_pd.c (revision 10779)
> +++ m_pd.c (working copy)
> @@ -295,6 +295,18 @@
> void glob_init(void);
> void garray_init(void);
>
> +/*--BEGIN moo--*/
> +#include <locale.h>
> +void locale_init(void) {
> + setlocale(LC_ALL,"");
> + setlocale(LC_NUMERIC,"C");
> + setlocale(LC_CTYPE,"en_US.UTF-8");
> + /*
> + printf("moo: locale=%s\n", setlocale(LC_ALL,NULL));
> + printf("moo: LC_CTYPE=%s\n", setlocale(LC_CTYPE,NULL));
> + */
> +}
> +
> void pd_init(void)
> {
> mess_init();
> @@ -302,5 +314,5 @@
> conf_init();
> glob_init();
> garray_init();
> + locale_init(); /*-- moo --*/
> }
> -
> Index: g_editor.c
> ===================================================================
> --- g_editor.c (revision 10779)
> +++ g_editor.c (working copy)
> @@ -1468,9 +1468,16 @@
> gotkeysym = av[1].a_w.w_symbol;
> else if (av[1].a_type == A_FLOAT)
> {
> + /*-- moo: old
> char buf[3];
> - sprintf(buf, "%c", (int)(av[1].a_w.w_float));
> + sprintf(buf, "%c", (int)(av[1].a_w.w_float));
> gotkeysym = gensym(buf);
> + --*/
> + char buf[8];
> + sprintf(buf, "%C", (int)(av[1].a_w.w_float));
> + /*printf("moo: charcode %%d=%d, %%c=%c, %%C=%C\n", (int)
> (av[1].a_w.w_float), (int)(av[1].a_w.w_float), (int)
> (av[1].a_w.w_float));*/
> + /*printf("moo: buf='%s'\n", buf);*/
> + gotkeysym = gensym(buf);
> }
> else gotkeysym = gensym("?");
> fflag = (av[0].a_type == A_FLOAT ? av[0].a_w.w_float : 0);
> Index: pd_connect.tcl
> ===================================================================
> --- pd_connect.tcl (revision 10779)
> +++ pd_connect.tcl (working copy)
> @@ -11,6 +11,10 @@
>
> proc ::pd_connect::configure_socket {sock} {
> fconfigure $sock -blocking 0 -buffering line
> +##--moo
> + fconfigure $sock -encoding utf-8
> +# puts "moo: fconfigure socket -encoding = [fconfigure $sock -
> encoding]"
> +##--/moo
> fileevent $sock readable {::pd_connect::pd_readsocket ""}
> }
>
> @@ -50,6 +54,11 @@
> proc ::pd_connect::pdsend {message} {
> variable pd_socket
> append message \;
> +##--moo
> +# if {[lindex $message 1] != {motion}} {
> +# puts "moo: pdsend enc={[fconfigure $pd_socket -encoding]}
> msg={$message}"
> +# }
> +##--/moo
> if {[catch {puts $pd_socket $message} errorname]} {
> puts stderr "pdsend errorname: >>$errorname<<"
> error "Not connected to 'pd' process"
> @@ -64,6 +73,9 @@
> exit
> }
> append cmd_from_pd [read $pd_socket]
> +##--moo
> +# puts "moo: pd_readsocket enc={[fconfigure $pd_socket -
> encoding]} cmd_from_pd={$cmd_from_pd}"
> +##--/moo
> if {[catch {uplevel #0 $cmd_from_pd} errorname]} {
> global errorInfo
> puts stderr "errorname: >>$errorname<<"
> Index: pd.tk
> ===================================================================
> --- pd.tk (revision 10779)
> +++ pd.tk (working copy)
> @@ -152,6 +152,15 @@
> # [string range \
> # [registry get {HKEY_CURRENT_USER\Control Panel\International}
> sLanguage] 0 1] ]
> #}
> +
> +##--moo
> + encoding system utf-8
> + fconfigure stderr -encoding utf-8
> + fconfigure stdout -encoding utf-8
> + puts "moo: encoding system = [encoding system]"
> + puts "moo: encoding stderr = [fconfigure stderr -encoding]"
> + puts "moo: encoding stdout = [fconfigure stdout -encoding]"
> +##--/moo
> }
>
> #
> ------------------------------------------------------------------------------
> Index: g_rtext.c
> ===================================================================
> --- g_rtext.c (revision 10779)
> +++ g_rtext.c (working copy)
> @@ -447,8 +447,13 @@
>
> /* at Guenter's suggestion, use 'n>31' to test wither a character
> might
> be printable in whatever 8-bit character set we find ourselves. */
> +/*-- moo: ... but test with "<" rather than "!=" in order to
> accomodate unicode codepoints for n
> + (which we get since Tk is sending the "%A" substitution for
> bind <Key>",
> + effectively reducing the coverage of this clause to 7 bits;
> case n>127
> + is covered by the next clause.
> + --*/
>
> - if (n == '\n' || (n > 31 && n != 127))
> + if (n == '\n' || (n > 31 /*&& n != 127*/ && n < 127)) /*--
> moo --*/
> {
> newsize = x->x_bufsize+1;
> x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> @@ -457,7 +462,21 @@
> x->x_buf[x->x_selstart] = n;
> x->x_bufsize = newsize;
> x->x_selstart = x->x_selstart + 1;
> + }
> + /*--moo: check for 8-bit or unicode codepoints, assuming "keysym"
> is a correctly encoded (UTF-8) string--*/
> + else if (n > 127) {
> + int clen = strlen(keysym->s_name);
> + newsize = x->x_bufsize + clen;
> + x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> + for (i = x->x_bufsize; i > x->x_selstart; i--)
> + x->x_buf[i] = x->x_buf[i-1];
> + x->x_bufsize = newsize;
> + /*-- insert keysym->s_name, rather than decoding the unicode
> value here --*/
> + //strncpy(x->x_buf+x->x_selstart, keysym->s_name, clen);
> + strcpy(x->x_buf+x->x_selstart, keysym->s_name);
> + x->x_selstart = x->x_selstart + clen;
> }
> + /*--/moo--*/
> x->x_selend = x->x_selstart;
> x->x_glist->gl_editor->e_textdirty = 1;
> }
----------------------------------------------------------------------------
'You people have such restrictive dress for women,’ she said, hobbling
away in three inch heels and panty hose to finish out another pink-
collar temp pool day. - “Hijab Scene #2", by Mohja Kahf
More information about the Pd-list
mailing list