[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

Thu Feb 19 18:43:49 CET 2009

This is good news!  While the C changes aren't dead simple, they are  
not bad.  I think they could be slightly simplified.  One thing that  
would make it much easier to read the diff is if you create it without  
whitespace changes.  So like this:

svn diff -x -w

As for the Tcl changes, I think we can include those now in Pd-devel,  
as long they can work ok with unchanged C code.  Then once the new Tcl  
GUI is included we can refactor the C side of things with things like  
this.  One other thing, it seems that the ASCII char are handled  
differently than the UTF-8 chars in g_rtext.c, I think you could use  
instead wcswidth(), mbstowcs() or other UTF-8 functions as described  
in the UTF-8 FAQ

http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

.hc

On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:

> morning Hans, morning list,
>
> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across  
> the
> board.  The TK side was easy (as Hans predicted); really just a call  
> to
> {fconfigure} in ::pd_connect::configure_socket.  I also set the output
> encoding to UTF-8 on Tk's stdout and stderr, for debugging purposes;
> it's probably wisest to leave those encodings at the default (user's
> current locale LC_CTYPE) for a release-like version.
>
> The C side is much hairier.  I think I've got things basically working
> (at least for message boxes and comments), but it has so far required
> changes in:
>
> FILE: g_editor.c
> + changed handling of <Key> events as passed to the C side to generate
> UTF-8 symbol-strings rather than single-byte stringlets.
>
> + currently use sprintf("%C") to get the UTF-8 string for the  
> codepoint
> passed from Tk; a safer (and not too hard) way would be to pass the
> actual UTF-8 string from Tk and just copy that: this would avoid the
> m_pd.c hacks forcing LC_CTYPE=en_US.UTF-8 (see below).  Another option
> would be actually just writing (or borrowing) the code to generate  
> UTF-8
> strings from Unicode codepoints.  It's pretty simple stuff; I've still
> got the guts of it somewhere (only written for latin-1 so far, but the
> principle is the same for all codepoints).
>
> FILE: m_pd.c
> + added calls to setlocale() to set LC_CTYPE to en_US.UTF-8; this is  
> an
> ugly stinky nasty hack to get sprintf("%C") to output a UTF-8 encoded
> string from an unicode codepoint int, as called by canvas_key() in
> g_editor.c
>
> FILE: g_rtext.c
> + added an 'else if' clause in rtext_key() to handle unicode  
> codepoints
> as values of the 'keynum' parameter.  should also be safe for any 8- 
> bit
> fixed-width encoding.
>
> FILE: pd.tk
> + set system encoding, also output encoding for stdout, stderr to  
> UTF-8
>
> Attached is a screenshot and a test patch.  UTF-8 input from the
> keyboard works with the test patch, and gets carried through  
> properly to
> the .pd file (and back on load).
>
> I'd like to get symbol atoms working too (haven't tried yet), but  
> there
> are still some nasty buglets with comments and message boxes, mostly
> that editing any multibyte characters is very tricky: looks like the  
> Tk
> point (cursor) and selection are expressed in characters, and Pd's C
> side is still thinking in bytes, though I'm totally ignorant of  
> where or
> how that can be changed.  A non-critical buglet with the same cause
> (probably) is that the C side is computing the required width for
> message boxes based on byte lengths, not character lengths, so message
> boxes containing multibyte characters look too wide.  I could live  
> with
> that, but the editing thing is a real pain...
>
> I've attached a diff of my changes against branches/pd-devel/0.41.4/ 
> src
> (please excuse commented-out debugging code), in case anyone wants to
> try this stuff out.  Since it's not working, I'm reluctant to check
> anything into the pd-devel/0.41.4 branch yet -- should I branch again
> for a work in progress, or do we just pass diffs around for now?
>
> marmosets,
> 	Bryan
>
> On 2009-02-12 06:24:44, Hans-Christoph Steiner <hans at eds.org>  
> appears to
> have written:
>> On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
>>> On 2009-02-11 03:04:34, Hans-Christoph Steiner <hans at eds.org>  
>>> appears to
>>>> This is something that I would really like to have working  
>>>> properly in
>>>> Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should  
>>>> support
>>>> UTF-8 in Pd.  Anyone feel like trying to fix it?
>
> -- 
> Bryan Jurish                           "There is *always* one more  
> bug."
> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic  
> Entomology
> <test-utf8.pd><test-utf8.png>Index: m_pd.c
> ===================================================================
> --- m_pd.c	(revision 10779)
> +++ m_pd.c	(working copy)
> @@ -295,6 +295,18 @@
> void glob_init(void);
> void garray_init(void);
>
> +/*--BEGIN moo--*/
> +#include <locale.h>
> +void locale_init(void) {
> +  setlocale(LC_ALL,"");
> +  setlocale(LC_NUMERIC,"C");
> +  setlocale(LC_CTYPE,"en_US.UTF-8");
> +  /*
> +  printf("moo: locale=%s\n", setlocale(LC_ALL,NULL));
> +  printf("moo: LC_CTYPE=%s\n", setlocale(LC_CTYPE,NULL));
> +  */
> +}
> +
> void pd_init(void)
> {
>     mess_init();
> @@ -302,5 +314,5 @@
>     conf_init();
>     glob_init();
>     garray_init();
> +    locale_init(); /*-- moo --*/
> }
> -
> Index: g_editor.c
> ===================================================================
> --- g_editor.c	(revision 10779)
> +++ g_editor.c	(working copy)
> @@ -1468,9 +1468,16 @@
>         gotkeysym = av[1].a_w.w_symbol;
>     else if (av[1].a_type == A_FLOAT)
>     {
> +	/*-- moo: old
>         char buf[3];
> -        sprintf(buf, "%c", (int)(av[1].a_w.w_float));
> +	sprintf(buf, "%c", (int)(av[1].a_w.w_float));
>         gotkeysym = gensym(buf);
> +	--*/
> +        char buf[8];
> +	sprintf(buf, "%C", (int)(av[1].a_w.w_float));
> +	/*printf("moo: charcode %%d=%d, %%c=%c, %%C=%C\n", (int) 
> (av[1].a_w.w_float), (int)(av[1].a_w.w_float), (int) 
> (av[1].a_w.w_float));*/
> +	/*printf("moo: buf='%s'\n", buf);*/
> +        gotkeysym = gensym(buf);
>     }
>     else gotkeysym = gensym("?");
>     fflag = (av[0].a_type == A_FLOAT ? av[0].a_w.w_float : 0);
> Index: pd_connect.tcl
> ===================================================================
> --- pd_connect.tcl	(revision 10779)
> +++ pd_connect.tcl	(working copy)
> @@ -11,6 +11,10 @@
>
> proc ::pd_connect::configure_socket {sock} {
> 	fconfigure $sock -blocking 0 -buffering line
> +##--moo
> +    fconfigure $sock -encoding utf-8
> +#    puts "moo: fconfigure socket -encoding = [fconfigure $sock - 
> encoding]"
> +##--/moo
> 	fileevent $sock readable {::pd_connect::pd_readsocket ""}
> }
>
> @@ -50,6 +54,11 @@
> proc ::pd_connect::pdsend {message} {
> 	variable pd_socket
> 	append message \;
> +##--moo
> +#    if {[lindex $message 1] != {motion}} {
> +#      puts "moo: pdsend enc={[fconfigure $pd_socket -encoding]}  
> msg={$message}"
> +#    }
> +##--/moo
> 	if {[catch {puts $pd_socket $message} errorname]} {
> 		puts stderr "pdsend errorname: >>$errorname<<"
> 		error "Not connected to 'pd' process"
> @@ -64,6 +73,9 @@
> 		exit
> 	}
> 	append cmd_from_pd [read $pd_socket]
> +##--moo
> +#    puts "moo: pd_readsocket enc={[fconfigure $pd_socket - 
> encoding]} cmd_from_pd={$cmd_from_pd}"
> +##--/moo
> 	if {[catch {uplevel #0 $cmd_from_pd} errorname]} {
> 		global errorInfo
> 		puts stderr "errorname: >>$errorname<<"
> Index: pd.tk
> ===================================================================
> --- pd.tk	(revision 10779)
> +++ pd.tk	(working copy)
> @@ -152,6 +152,15 @@
> 	#		[string range \
> 	#		[registry get {HKEY_CURRENT_USER\Control Panel\International}  
> sLanguage] 0 1] ]
> 	#}
> +
> +##--moo
> +    encoding system utf-8
> +    fconfigure stderr -encoding utf-8
> +    fconfigure stdout -encoding utf-8
> +    puts "moo: encoding system = [encoding system]"
> +    puts "moo: encoding stderr = [fconfigure stderr -encoding]"
> +    puts "moo: encoding stdout = [fconfigure stdout -encoding]"
> +##--/moo
> }
>
> #  
> ------------------------------------------------------------------------------
> Index: g_rtext.c
> ===================================================================
> --- g_rtext.c	(revision 10779)
> +++ g_rtext.c	(working copy)
> @@ -447,8 +447,13 @@
>
> /* at Guenter's suggestion, use 'n>31' to test wither a character  
> might
> be printable in whatever 8-bit character set we find ourselves. */
> +/*-- moo: ... but test with "<" rather than "!=" in order to  
> accomodate unicode codepoints for n
> +     (which we get since Tk is sending the "%A" substitution for  
> bind <Key>",
> +     effectively reducing the coverage of this clause to 7 bits;  
> case n>127
> +     is covered by the next clause.
> +  --*/
>
> -        if (n == '\n' || (n > 31 && n != 127))
> +        if (n == '\n' || (n > 31 /*&& n != 127*/ && n < 127)) /*--  
> moo --*/
>         {
>             newsize = x->x_bufsize+1;
>             x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> @@ -457,7 +462,21 @@
>             x->x_buf[x->x_selstart] = n;
>             x->x_bufsize = newsize;
>             x->x_selstart = x->x_selstart + 1;
> +	}
> +	/*--moo: check for 8-bit or unicode codepoints, assuming "keysym"  
> is a correctly encoded (UTF-8) string--*/
> +	else if (n > 127) {
> +	  int clen = strlen(keysym->s_name);
> +	  newsize = x->x_bufsize + clen;
> +	  x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> +	  for (i = x->x_bufsize; i > x->x_selstart; i--)
> +	    x->x_buf[i] = x->x_buf[i-1];
> +	  x->x_bufsize = newsize;
> +	  /*-- insert keysym->s_name, rather than decoding the unicode  
> value here --*/
> +	  //strncpy(x->x_buf+x->x_selstart, keysym->s_name, clen);
> +	  strcpy(x->x_buf+x->x_selstart, keysym->s_name);
> +	  x->x_selstart = x->x_selstart + clen;
>         }
> +	/*--/moo--*/
>         x->x_selend = x->x_selstart;
>         x->x_glist->gl_editor->e_textdirty = 1;
>     }

----------------------------------------------------------------------------

'You people have such restrictive dress for women,’ she said, hobbling  
away in three inch heels and panty hose to finish out another pink- 
collar temp pool day.  - “Hijab Scene #2", by Mohja Kahf