[PD-dev] UTF-8 for pd-devel (again)

Wed Jan 20 04:56:28 CET 2010

Miller, how about the UTF-8 patch?

.hc

On Jan 19, 2010, at 10:15 PM, Miller Puckette wrote:

> 127 is 'delete' -- ascii all right, but not 'printable'.
>
> cheers
> Miller
>
> On Tue, Jan 19, 2010 at 09:37:08PM -0500, Hans-Christoph Steiner  
> wrote:
>>
>> Looks good to me. One comment, shouldn't this be n<128?  127 is an
>> ASCII char, AFAIK.
>>
>> +        if (n == '\n' || (n > 31 && n < 127))
>>
>> It looks worth checking to me, hopefully we can get Miller and others
>> to weigh in on it.
>>
>> .hc
>>
>> On Jan 19, 2010, at 4:16 PM, Bryan Jurish wrote:
>>
>>> morning all,
>>>
>>> attached is a UTF-8 support patch against branches/pd-gui-rewrite/ 
>>> 0.43
>>> revision 13051 (HEAD as of an hour or so ago).  most of the bulk is
>>> new
>>> files (s_utf8.c, s_utf8.h), most other changes are in g_rtext.c.   
>>> It's
>>> not too monstrous, and I've tested it again here briefly with some
>>> utf-8
>>> test patches (see other attachment), and things appear to be working
>>> as
>>> expected.  if desired, I can check this in; otherwise feel free to
>>> do it
>>> for me ;-)
>>>
>>> 2 annoying things here during testing (I don't see how my patches
>>> could
>>> have caused this, but you never know):
>>>
>>> (1) all loaded patch windows appear at +0+0 (upper left corner),  
>>> which
>>> with my wm (windowmaker) means the title bar is off the screen,  
>>> and I
>>> have to resort to keyboard shortcuts to get them mouse-draggable,
>>> which
>>> is a major pain in the wazoo: is this a known bug?
>>>
>>> (2) I can't figure out how to get at the properties dialog for  
>>> number,
>>> number2, or any other gui-atom objects: should these be working
>>> already?
>>>
>>> marmosets,
>>> 	Bryan
>>>
>>> On 2010-01-18 23:09:34, Hans-Christoph Steiner <hans at eds.org>
>>> appears to
>>> have written:
>>>>
>>>> Awesome!  If its big and complicated, I say post it to the list
>>>> first,
>>>> if not too bad, then just commit.
>>>>
>>>> .hc
>>>>
>>>> On Jan 18, 2010, at 4:47 AM, Bryan Jurish wrote:
>>>>
>>>>> moin Hans, moin list,
>>>>>
>>>>> I think perhaps I never actually did post the cleaned-up patch
>>>>> anywhere
>>>>> (bad programmer, no biscuit);  I guess I'll check out
>>>>> branches/pd-gui-rewrite/0.43 and try patching my changes in; then
>>>>> I can
>>>>> either commit or just post the (updated) patch.  Hopefully no  
>>>>> major
>>>>> additional changes will be required, so it ought to go pretty  
>>>>> fast.
>>>>>
>>>>> marmosets,
>>>>>  Bryan
>>>>>
>>>>> On 2010-01-17 22:57:33, Hans-Christoph Steiner <hans at eds.org>
>>>>> appears to
>>>>> have written:
>>>>>>
>>>>>> Hey Bryan,
>>>>>>
>>>>>> I'd like to try to get your UTF-8 code into pd-gui-rewrite.  You
>>>>>> mention
>>>>>> in this posting back in May that you had the whole thing
>>>>>> working.  I
>>>>>> couldn't find the diff/patch for this.  Is it posted anywhere?
>>>>>> Do you
>>>>>> want to try to check it in yourself directly to the pd-gui-
>>>>>> rewrite/0.43
>>>>>> branch?
>>>>>>
>>>>>> .hc
>>>>>>
>>>>>>
>>>>>> On Mar 20, 2009, at 6:16 PM, Bryan Jurish wrote:
>>>>>>
>>>>>>> morning all,
>>>>>>>
>>>>>>> Of course I never really like to see my code wither away in the
>>>>>>> bit
>>>>>>> bucket, but I personally don't have any pressing need for UTF-8
>>>>>>> symbols,
>>>>>>> comments, etc. in Pd -- I'm a native English speaker, after
>>>>>>> all ;-)
>>>>>>>
>>>>>>> Also, my changes are by no means the only way to do it (or even
>>>>>>> the
>>>>>>> best
>>>>>>> way); we could gain a little speed by slapping on some more
>>>>>>> buffers
>>>>>>> (mostly and possibly only in rtext_senditup()), but since this
>>>>>>> seems to
>>>>>>> effect only GUI/editing stuff, I think we can live with a
>>>>>>> smidgeon of
>>>>>>> additional cpu time ... after all, it's all O(n) anyways.
>>>>>>>
>>>>>>> Really I just wanted to see how easy (or difficult) it would be
>>>>>>> to get
>>>>>>> Pd to use UTF-8 as its internal encoding... turned out to be
>>>>>>> harder
>>>>>>> than
>>>>>>> I had thought, but (ever so slightly) easier than I had  
>>>>>>> feared :-/
>>>>>>>
>>>>>>> marmosets,
>>>>>>> Bryan
>>>>>>>
>>>>>>> On 2009-03-20 18:39:06, Hans-Christoph Steiner <hans at eds.org>
>>>>>>> appears to
>>>>>>> have written:
>>>>>>>>
>>>>>>>> I wonder what the best approach is to getting it included.  I
>>>>>>>> also
>>>>>>>> think
>>>>>>>> its a very valuable contribution.  I think we need to first get
>>>>>>>> the
>>>>>>>> Tcl/Tk only changes done, since that was the mandate of the pd-
>>>>>>>> devel
>>>>>>>> 0.41 effort.  Then once Miller has accepted those changes, then
>>>>>>>> we can
>>>>>>>> start with the C modifications there.  So how to proceed next,
>>>>>>>> I think
>>>>>>>> is based on how eager you are, Bryan, to getting this in a
>>>>>>>> regular
>>>>>>>> build.
>>>>>>>>
>>>>>>>> One option is making a pd-devel-utf8 branch, another is posting
>>>>>>>> these
>>>>>>>> patches to the patch tracker and waiting for Miller to make his
>>>>>>>> next
>>>>>>>> update with the Pd-devel Tcl-Tk code.
>>>>>>>>
>>>>>>>> Maybe we can get Miller to chime in on this topic.
>>>>>>>>
>>>>>>>> .hc
>>>>>>>>
>>>>>>>> On Mar 13, 2009, at 12:00 AM, dmotd wrote:
>>>>>>>>
>>>>>>>>> hey bryan,
>>>>>>>>>
>>>>>>>>> just a quick note of a appreciation for getting this one out..
>>>>>>>>> i hope
>>>>>>>>> it gets
>>>>>>>>> picked up in millers build soon.. a very useful and necessary
>>>>>>>>> modification.
>>>>>>>>>
>>>>>>>>> well done!
>>>>>>>>>
>>>>>>>>> dmotd
>>>>>>>>>
>>>>>>>>> On Thursday 12 March 2009 08:07:50 Bryan Jurish wrote:
>>>>>>>>>> moin folks,
>>>>>>>>>>
>>>>>>>>>> I believe I've finally got pd-devel 0.41-4 using UTF-8 across
>>>>>>>>>> the
>>>>>>>>>> board.
>>>>>>>>>> So far, I've tested message boxes & comments (g_rtext), as
>>>>>>>>>> well as
>>>>>>>>>> symbol atoms, and all seems good.  I think we can still  
>>>>>>>>>> expect
>>>>>>>>>> goofiness
>>>>>>>>>> if someone names an abstraction using a multibyte character
>>>>>>>>>> when the
>>>>>>>>>> filesystem isn't UTF-8 encoded (raw 8-bit works for me here
>>>>>>>>>> too),
>>>>>>>>>> but I
>>>>>>>>>> really don't want to open that particular can of worms.
>>>>>>>>>>
>>>>>>>>>> So I guess I have 2 questions:
>>>>>>>>>>
>>>>>>>>>> (1) what should I call the generic UTF-8 source files? (see
>>>>>>>>>> my other
>>>>>>>>>> post)
>>>>>>>>>>
>>>>>>>>>> (2) shall I commit these changes to pd-devel/0.41-4, or
>>>>>>>>>> somewhere
>>>>>>>>>> else,
>>>>>>>>>> or just post a diff (ca. 33k, ought to be easier to read now;
>>>>>>>>>> I've
>>>>>>>>>> tried
>>>>>>>>>> to follow the indentation conventions of the source files I
>>>>>>>>>> modified)?
>>>>>>>>>>
>>>>>>>>>> marmosets,
>>>>>>>>>> Bryan
>>>>>>>
>>>>>>> -- 
>>>>>>> Bryan Jurish                           "There is *always* one  
>>>>>>> more
>>>>>>> bug."
>>>>>>> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic
>>>>>>> Entomology
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> The arc of history bends towards justice.     - Dr. Martin Luther
>>>>>> King, Jr.
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> ***************************************************
>>>>>
>>>>> Bryan Jurish
>>>>> Deutsches Textarchiv
>>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>>
>>>>> J?gerstr. 22/23
>>>>> 10117 Berlin
>>>>>
>>>>> Tel.:      +49 (0)30 20370 539
>>>>> E-Mail:    jurish at bbaw.de
>>>>>
>>>>> ***************************************************
>>>>>
>>>>
>>>>
>>>>
>>>> ----------------------------------------------------------------------------
>>>>
>>>>
>>>> As we enjoy great advantages from inventions of others, we should  
>>>> be
>>>> glad of an opportunity to serve others by any invention of ours;  
>>>> and
>>>> this we should do freely and generously.         - Benjamin  
>>>> Franklin
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> Bryan Jurish                       "There is *always* one more bug."
>>> jurish at uni-potsdam.de       -Lubarsky's Law of Cybernetic Entomology
>>> Index: src/Makefile.am
>>> ===================================================================
>>> --- src/Makefile.am	(revision 13051)
>>> +++ src/Makefile.am	(working copy)
>>> @@ -24,6 +24,7 @@
>>>   m_conf.c m_glob.c m_sched.c \
>>>   s_main.c s_inter.c s_file.c s_print.c \
>>>   s_loader.c s_path.c s_entry.c s_audio.c s_midi.c \
>>> +    s_utf8.c \
>>>   d_ugen.c d_ctl.c d_arithmetic.c d_osc.c d_filter.c d_dac.c
>>> d_misc.c \
>>>   d_math.c d_fft.c d_array.c d_global.c \
>>>   d_delay.c d_resample.c \
>>> Index: src/g_editor.c
>>> ===================================================================
>>> --- src/g_editor.c	(revision 13051)
>>> +++ src/g_editor.c	(working copy)
>>> @@ -9,6 +9,7 @@
>>> #include "s_stuff.h"
>>> #include "g_canvas.h"
>>> #include <string.h>
>>> +#include "s_utf8.h" /*-- moo --*/
>>>
>>> void glist_readfrombinbuf(t_glist *x, t_binbuf *b, char *filename,
>>>   int selectem);
>>> @@ -1666,8 +1667,9 @@
>>>       gotkeysym = av[1].a_w.w_symbol;
>>>   else if (av[1].a_type == A_FLOAT)
>>>   {
>>> -        char buf[3];
>>> -        sprintf(buf, "%c", (int)(av[1].a_w.w_float));
>>> +        /*-- moo: assume keynum is a Unicode codepoint; encode as
>>> UTF-8 --*/
>>> +        char buf[UTF8_MAXBYTES1];
>>> +        u8_wc_toutf8_nul(buf, (UCS4)(av[1].a_w.w_float));
>>>       gotkeysym = gensym(buf);
>>>   }
>>>   else gotkeysym = gensym("?");
>>> Index: src/s_utf8.c
>>> ===================================================================
>>> --- src/s_utf8.c	(revision 0)
>>> +++ src/s_utf8.c	(revision 0)
>>> @@ -0,0 +1,280 @@
>>> +/*
>>> +  Basic UTF-8 manipulation routines
>>> +  by Jeff Bezanson
>>> +  placed in the public domain Fall 2005
>>> +
>>> +  This code is designed to provide the utilities you need to
>>> manipulate
>>> +  UTF-8 as an internal string encoding. These functions do not
>>> perform the
>>> +  error checking normally needed when handling UTF-8 data, so if
>>> you happen
>>> +  to be from the Unicode Consortium you will want to flay me alive.
>>> +  I do this because error checking can be performed at the
>>> boundaries (I/O),
>>> +  with these routines reserved for higher performance on data known
>>> to be
>>> +  valid.
>>> +
>>> +  modified by Bryan Jurish (moo) March 2009
>>> +  + removed some unneeded functions (escapes, printf etc), added
>>> others
>>> +*/
>>> +#include <stdlib.h>
>>> +#include <stdio.h>
>>> +#include <string.h>
>>> +#include <stdarg.h>
>>> +#ifdef WIN32
>>> +#include <malloc.h>
>>> +#else
>>> +#include <alloca.h>
>>> +#endif
>>> +
>>> +#include "s_utf8.h"
>>> +
>>> +static const u_int32_t offsetsFromUTF8[6] = {
>>> +    0x00000000UL, 0x00003080UL, 0x000E2080UL,
>>> +    0x03C82080UL, 0xFA082080UL, 0x82082080UL
>>> +};
>>> +
>>> +static const char trailingBytesForUTF8[256] = {
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  
>>> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>> +    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  
>>> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>>> +    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  
>>> 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
>>> +};
>>> +
>>> +
>>> +/* returns length of next utf-8 sequence */
>>> +int u8_seqlen(char *s)
>>> +{
>>> +    return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]]
>>> + 1;
>>> +}
>>> +
>>> +/* conversions without error checking
>>> +   only works for valid UTF-8, i.e. no 5- or 6-byte sequences
>>> +   srcsz = source size in bytes, or -1 if 0-terminated
>>> +   sz = dest size in # of wide characters
>>> +
>>> +   returns # characters converted
>>> +   dest will always be L'\0'-terminated, even if there isn't enough
>>> room
>>> +   for all the characters.
>>> +   if sz = srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be
>>> enough space.
>>> +*/
>>> +int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz)
>>> +{
>>> +    u_int32_t ch;
>>> +    char *src_end = src + srcsz;
>>> +    int nb;
>>> +    int i=0;
>>> +
>>> +    while (i < sz-1) {
>>> +        nb = trailingBytesForUTF8[(unsigned char)*src];
>>> +        if (srcsz == -1) {
>>> +            if (*src == 0)
>>> +                goto done_toucs;
>>> +        }
>>> +        else {
>>> +            if (src + nb >= src_end)
>>> +                goto done_toucs;
>>> +        }
>>> +        ch = 0;
>>> +        switch (nb) {
>>> +            /* these fall through deliberately */
>>> +#if UTF8_SUPPORT_FULL_UCS4
>>> +        case 5: ch += (unsigned char)*src++; ch <<= 6;
>>> +        case 4: ch += (unsigned char)*src++; ch <<= 6;
>>> +#endif
>>> +        case 3: ch += (unsigned char)*src++; ch <<= 6;
>>> +        case 2: ch += (unsigned char)*src++; ch <<= 6;
>>> +        case 1: ch += (unsigned char)*src++; ch <<= 6;
>>> +        case 0: ch += (unsigned char)*src++;
>>> +        }
>>> +        ch -= offsetsFromUTF8[nb];
>>> +        dest[i++] = ch;
>>> +    }
>>> + done_toucs:
>>> +    dest[i] = 0;
>>> +    return i;
>>> +}
>>> +
>>> +/* srcsz = number of source characters, or -1 if 0-terminated
>>> +   sz = size of dest buffer in bytes
>>> +
>>> +   returns # characters converted
>>> +   dest will only be '\0'-terminated if there is enough space. this
>>> is
>>> +   for consistency; imagine there are 2 bytes of space left, but
>>> the next
>>> +   character requires 3 bytes. in this case we could NUL-terminate,
>>> but in
>>> +   general we can't when there's insufficient space. therefore this
>>> function
>>> +   only NUL-terminates if all the characters fit, and there's space
>>> for
>>> +   the NUL as well.
>>> +   the destination string will never be bigger than the source
>>> string.
>>> +*/
>>> +int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz)
>>> +{
>>> +    u_int32_t ch;
>>> +    int i = 0;
>>> +    char *dest_end = dest + sz;
>>> +
>>> +    while (srcsz<0 ? src[i]!=0 : i < srcsz) {
>>> +        ch = src[i];
>>> +        if (ch < 0x80) {
>>> +            if (dest >= dest_end)
>>> +                return i;
>>> +            *dest++ = (char)ch;
>>> +        }
>>> +        else if (ch < 0x800) {
>>> +            if (dest >= dest_end-1)
>>> +                return i;
>>> +            *dest++ = (ch>>6) | 0xC0;
>>> +            *dest++ = (ch & 0x3F) | 0x80;
>>> +        }
>>> +        else if (ch < 0x10000) {
>>> +            if (dest >= dest_end-2)
>>> +                return i;
>>> +            *dest++ = (ch>>12) | 0xE0;
>>> +            *dest++ = ((ch>>6) & 0x3F) | 0x80;
>>> +            *dest++ = (ch & 0x3F) | 0x80;
>>> +        }
>>> +        else if (ch < 0x110000) {
>>> +            if (dest >= dest_end-3)
>>> +                return i;
>>> +            *dest++ = (ch>>18) | 0xF0;
>>> +            *dest++ = ((ch>>12) & 0x3F) | 0x80;
>>> +            *dest++ = ((ch>>6) & 0x3F) | 0x80;
>>> +            *dest++ = (ch & 0x3F) | 0x80;
>>> +        }
>>> +        i++;
>>> +    }
>>> +    if (dest < dest_end)
>>> +        *dest = '\0';
>>> +    return i;
>>> +}
>>> +
>>> +/* moo: get byte length of character number, or 0 if not  
>>> supported */
>>> +int u8_wc_nbytes(u_int32_t ch)
>>> +{
>>> +  if (ch < 0x80) return 1;
>>> +  if (ch < 0x800) return 2;
>>> +  if (ch < 0x10000) return 3;
>>> +  if (ch < 0x200000) return 4;
>>> +#if UTF8_SUPPORT_FULL_UCS4
>>> +  /*-- moo: support full UCS-4 range? --*/
>>> +  if (ch < 0x4000000) return 5;
>>> +  if (ch < 0x7fffffffUL) return 6;
>>> +#endif
>>> +  return 0; /*-- bad input --*/
>>> +}
>>> +
>>> +int u8_wc_toutf8(char *dest, u_int32_t ch)
>>> +{
>>> +    if (ch < 0x80) {
>>> +        dest[0] = (char)ch;
>>> +        return 1;
>>> +    }
>>> +    if (ch < 0x800) {
>>> +        dest[0] = (ch>>6) | 0xC0;
>>> +        dest[1] = (ch & 0x3F) | 0x80;
>>> +        return 2;
>>> +    }
>>> +    if (ch < 0x10000) {
>>> +        dest[0] = (ch>>12) | 0xE0;
>>> +        dest[1] = ((ch>>6) & 0x3F) | 0x80;
>>> +        dest[2] = (ch & 0x3F) | 0x80;
>>> +        return 3;
>>> +    }
>>> +    if (ch < 0x110000) {
>>> +        dest[0] = (ch>>18) | 0xF0;
>>> +        dest[1] = ((ch>>12) & 0x3F) | 0x80;
>>> +        dest[2] = ((ch>>6) & 0x3F) | 0x80;
>>> +        dest[3] = (ch & 0x3F) | 0x80;
>>> +        return 4;
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +/*-- moo --*/
>>> +int u8_wc_toutf8_nul(char *dest, u_int32_t ch)
>>> +{
>>> +  int sz = u8_wc_toutf8(dest,ch);
>>> +  dest[sz] = '\0';
>>> +  return sz;
>>> +}
>>> +
>>> +/* charnum => byte offset */
>>> +int u8_offset(char *str, int charnum)
>>> +{
>>> +    int offs=0;
>>> +
>>> +    while (charnum > 0 && str[offs]) {
>>> +        (void)(isutf(str[++offs]) || isutf(str[++offs]) ||
>>> +               isutf(str[++offs]) || ++offs);
>>> +        charnum--;
>>> +    }
>>> +    return offs;
>>> +}
>>> +
>>> +/* byte offset => charnum */
>>> +int u8_charnum(char *s, int offset)
>>> +{
>>> +    int charnum = 0, offs=0;
>>> +
>>> +    while (offs < offset && s[offs]) {
>>> +        (void)(isutf(s[++offs]) || isutf(s[++offs]) ||
>>> +               isutf(s[++offs]) || ++offs);
>>> +        charnum++;
>>> +    }
>>> +    return charnum;
>>> +}
>>> +
>>> +/* reads the next utf-8 sequence out of a string, updating an index
>>> */
>>> +u_int32_t u8_nextchar(char *s, int *i)
>>> +{
>>> +    u_int32_t ch = 0;
>>> +    int sz = 0;
>>> +
>>> +    do {
>>> +        ch <<= 6;
>>> +        ch += (unsigned char)s[(*i)++];
>>> +        sz++;
>>> +    } while (s[*i] && !isutf(s[*i]));
>>> +    ch -= offsetsFromUTF8[sz-1];
>>> +
>>> +    return ch;
>>> +}
>>> +
>>> +/* number of characters */
>>> +int u8_strlen(char *s)
>>> +{
>>> +    int count = 0;
>>> +    int i = 0;
>>> +
>>> +    while (u8_nextchar(s, &i) != 0)
>>> +        count++;
>>> +
>>> +    return count;
>>> +}
>>> +
>>> +void u8_inc(char *s, int *i)
>>> +{
>>> +    (void)(isutf(s[++(*i)]) || isutf(s[++(*i)]) ||
>>> +           isutf(s[++(*i)]) || ++(*i));
>>> +}
>>> +
>>> +void u8_dec(char *s, int *i)
>>> +{
>>> +    (void)(isutf(s[--(*i)]) || isutf(s[--(*i)]) ||
>>> +           isutf(s[--(*i)]) || --(*i));
>>> +}
>>> +
>>> +/*-- moo --*/
>>> +void u8_inc_ptr(char **sp)
>>> +{
>>> +  (void)(isutf(*(++(*sp))) || isutf(*(++(*sp))) ||
>>> +	 isutf(*(++(*sp))) || ++(*sp));
>>> +}
>>> +
>>> +/*-- moo --*/
>>> +void u8_dec_ptr(char **sp)
>>> +{
>>> +  (void)(isutf(*(--(*sp))) || isutf(*(--(*sp))) ||
>>> +	 isutf(*(--(*sp))) || --(*sp));
>>> +}
>>> Index: src/g_rtext.c
>>> ===================================================================
>>> --- src/g_rtext.c	(revision 13051)
>>> +++ src/g_rtext.c	(working copy)
>>> @@ -13,6 +13,7 @@
>>> #include "m_pd.h"
>>> #include "s_stuff.h"
>>> #include "g_canvas.h"
>>> +#include "s_utf8.h"
>>>
>>>
>>> #define LMARGIN 2
>>> @@ -32,10 +33,10 @@
>>>
>>> struct _rtext
>>> {
>>> -    char *x_buf;
>>> -    int x_bufsize;
>>> -    int x_selstart;
>>> -    int x_selend;
>>> +    char *x_buf;    /*-- raw byte string, assumed UTF-8 encoded
>>> (moo) --*/
>>> +    int x_bufsize;  /*-- byte length --*/
>>> +    int x_selstart; /*-- byte offset --*/
>>> +    int x_selend;   /*-- byte offset --*/
>>>   int x_active;
>>>   int x_dragfrom;
>>>   int x_height;
>>> @@ -119,6 +120,15 @@
>>>
>>> /* LATER deal with tcl-significant characters */
>>>
>>> +/* firstone(), lastone()
>>> + *  + returns byte offset of (first|last) occurrence of 'c' in
>>> 's[0..n-1]', or
>>> + *    -1 if none was found
>>> + *  + 's' is a raw byte string
>>> + *  + 'c' is a byte value
>>> + *  + 'n' is the length (in bytes) of the prefix of 's' to be
>>> searched.
>>> + *  + we could make these functions work on logical characters in
>>> utf8 strings,
>>> + *    but we don't really need to...
>>> + */
>>> static int firstone(char *s, int c, int n)
>>> {
>>>   char *s2 = s + n;
>>> @@ -155,6 +165,16 @@
>>>   of the entire text in pixels.
>>>   */
>>>
>>> +   /*-- moo:
>>> +    * + some variables from the original version have been renamed
>>> +    * + variables with a "_b" suffix are raw byte strings, lengths,
>>> or offsets
>>> +    * + variables with a "_c" suffix are logical character lengths
>>> or offsets
>>> +    *   (assuming valid UTF-8 encoded byte string in x->x_buf)
>>> +    * + a fair amount of O(n) computations required to convert
>>> between raw byte
>>> +    *   offsets (needed by the C side) and logical character
>>> offsets (needed by
>>> +    *   the GUI)
>>> +    */
>>> +
>>>   /* LATER get this and sys_vgui to work together properly,
>>>       breaking up messages as needed.  As of now, there's
>>>       a limit of 1950 characters, imposed by sys_vgui(). */
>>> @@ -171,14 +191,16 @@
>>> {
>>>   t_float dispx, dispy;
>>>   char smallbuf[200], *tempbuf;
>>> -    int outchars = 0, nlines = 0, ncolumns = 0,
>>> +    int outchars_b = 0, nlines = 0, ncolumns = 0,
>>>       pixwide, pixhigh, font, fontwidth, fontheight, findx, findy;
>>>   int reportedindex = 0;
>>>   t_canvas *canvas = glist_getcanvas(x->x_glist);
>>> -    int widthspec = x->x_text->te_width;
>>> -    int widthlimit = (widthspec ? widthspec : BOXWIDTH);
>>> -    int inindex = 0;
>>> -    int selstart = 0, selend = 0;
>>> +    int widthspec_c = x->x_text->te_width;
>>> +    int widthlimit_c = (widthspec_c ? widthspec_c : BOXWIDTH);
>>> +    int inindex_b = 0;
>>> +    int inindex_c = 0;
>>> +    int selstart_b = 0, selend_b = 0;
>>> +    int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);
>>>       /* if we're a GOP (the new, "goprect" style) borrow the font
>>> size
>>>       from the inside to preserve the spacing */
>>>   if (pd_class(&x->x_text->te_pd) == canvas_class &&
>>> @@ -193,65 +215,76 @@
>>>   if (x->x_bufsize >= 100)
>>>        tempbuf = (char *)t_getbytes(2 * x->x_bufsize + 1);
>>>   else tempbuf = smallbuf;
>>> -    while (x->x_bufsize - inindex > 0)
>>> +    while (x_bufsize_c - inindex_c > 0)
>>>   {
>>> -        int inchars = x->x_bufsize - inindex;
>>> -        int maxindex = (inchars > widthlimit ? widthlimit :  
>>> inchars);
>>> +        int inchars_b  = x->x_bufsize - inindex_b;
>>> +        int inchars_c  = x_bufsize_c  - inindex_c;
>>> +        int maxindex_c = (inchars_c > widthlimit_c ? widthlimit_c :
>>> inchars_c);
>>> +        int maxindex_b = u8_offset(x->x_buf + inindex_b,  
>>> maxindex_c);
>>>       int eatchar = 1;
>>> -        int foundit = firstone(x->x_buf + inindex, '\n', maxindex);
>>> -        if (foundit < 0)
>>> +        int foundit_b  = firstone(x->x_buf + inindex_b, '\n',
>>> maxindex_b);
>>> +        int foundit_c;
>>> +        if (foundit_b < 0)
>>>       {
>>> -            if (inchars > widthlimit)
>>> +            if (inchars_c > widthlimit_c)
>>>           {
>>> -                foundit = lastone(x->x_buf + inindex, ' ',  
>>> maxindex);
>>> -                if (foundit < 0)
>>> +                foundit_b = lastone(x->x_buf + inindex_b, ' ',
>>> maxindex_b);
>>> +                if (foundit_b < 0)
>>>               {
>>> -                    foundit = maxindex;
>>> +                    foundit_b = maxindex_b;
>>> +                    foundit_c = maxindex_c;
>>>                   eatchar = 0;
>>>               }
>>> +                else
>>> +                    foundit_c = u8_charnum(x->x_buf + inindex_b,
>>> foundit_b);
>>>           }
>>>           else
>>>           {
>>> -                foundit = inchars;
>>> +                foundit_b = inchars_b;
>>> +                foundit_c = inchars_c;
>>>               eatchar = 0;
>>>           }
>>>       }
>>> +        else
>>> +            foundit_c = u8_charnum(x->x_buf + inindex_b,  
>>> foundit_b);
>>> +
>>>       if (nlines == findy)
>>>       {
>>>           int actualx = (findx < 0 ? 0 :
>>> -                (findx > foundit ? foundit : findx));
>>> -            *indexp = inindex + actualx;
>>> +                (findx > foundit_c ? foundit_c : findx));
>>> +            *indexp = inindex_b + u8_offset(x->x_buf + inindex_b,
>>> actualx);
>>>           reportedindex = 1;
>>>       }
>>> -        strncpy(tempbuf+outchars, x->x_buf + inindex, foundit);
>>> -        if (x->x_selstart >= inindex &&
>>> -            x->x_selstart <= inindex + foundit + eatchar)
>>> -                selstart = x->x_selstart + outchars - inindex;
>>> -        if (x->x_selend >= inindex &&
>>> -            x->x_selend <= inindex + foundit + eatchar)
>>> -                selend = x->x_selend + outchars - inindex;
>>> -        outchars += foundit;
>>> -        inindex += (foundit + eatchar);
>>> -        if (inindex < x->x_bufsize)
>>> -            tempbuf[outchars++] = '\n';
>>> -        if (foundit > ncolumns)
>>> -            ncolumns = foundit;
>>> +        strncpy(tempbuf+outchars_b, x->x_buf + inindex_b,  
>>> foundit_b);
>>> +        if (x->x_selstart >= inindex_b &&
>>> +            x->x_selstart <= inindex_b + foundit_b + eatchar)
>>> +                selstart_b = x->x_selstart + outchars_b -  
>>> inindex_b;
>>> +        if (x->x_selend >= inindex_b &&
>>> +            x->x_selend <= inindex_b + foundit_b + eatchar)
>>> +                selend_b = x->x_selend + outchars_b - inindex_b;
>>> +        outchars_b += foundit_b;
>>> +        inindex_b += (foundit_b + eatchar);
>>> +        inindex_c += (foundit_c + eatchar);
>>> +        if (inindex_b < x->x_bufsize)
>>> +            tempbuf[outchars_b++] = '\n';
>>> +        if (foundit_c > ncolumns)
>>> +            ncolumns = foundit_c;
>>>       nlines++;
>>>   }
>>>   if (!reportedindex)
>>> -        *indexp = outchars;
>>> +        *indexp = outchars_b;
>>>   dispx = text_xpix(x->x_text, x->x_glist);
>>>   dispy = text_ypix(x->x_text, x->x_glist);
>>>   if (nlines < 1) nlines = 1;
>>> -    if (!widthspec)
>>> +    if (!widthspec_c)
>>>   {
>>>       while (ncolumns < 3)
>>>       {
>>> -            tempbuf[outchars++] = ' ';
>>> +            tempbuf[outchars_b++] = ' ';
>>>           ncolumns++;
>>>       }
>>>   }
>>> -    else ncolumns = widthspec;
>>> +    else ncolumns = widthspec_c;
>>>   pixwide = ncolumns * fontwidth + (LMARGIN + RMARGIN);
>>>   pixhigh = nlines * fontheight + (TMARGIN + BMARGIN);
>>>
>>> @@ -259,31 +292,32 @@
>>>       sys_vgui("pdtk_text_new .x%lx.c {%s %s text} %f %f {%.*s} %d
>>> %s\n",
>>>           canvas, x->x_tag, rtext_gettype(x)->s_name,
>>>           dispx + LMARGIN, dispy + TMARGIN,
>>> -            outchars, tempbuf, sys_hostfontsize(font),
>>> +            outchars_b, tempbuf, sys_hostfontsize(font),
>>>           (glist_isselected(x->x_glist,
>>>               &x->x_glist->gl_gobj)? "blue" : "black"));
>>>   else if (action == SEND_UPDATE)
>>>   {
>>>       sys_vgui("pdtk_text_set .x%lx.c %s {%.*s}\n",
>>> -            canvas, x->x_tag, outchars, tempbuf);
>>> +            canvas, x->x_tag, outchars_b, tempbuf);
>>>       if (pixwide != x->x_drawnwidth || pixhigh != x->x_drawnheight)
>>>           text_drawborder(x->x_text, x->x_glist, x->x_tag,
>>>               pixwide, pixhigh, 0);
>>>       if (x->x_active)
>>>       {
>>> -            if (selend > selstart)
>>> +            if (selend_b > selstart_b)
>>>           {
>>>               sys_vgui(".x%lx.c select from %s %d\n", canvas,
>>> -                    x->x_tag, selstart);
>>> +                    x->x_tag, u8_charnum(x->x_buf, selstart_b));
>>>               sys_vgui(".x%lx.c select to %s %d\n", canvas,
>>> -                    x->x_tag, selend + (sys_oldtclversion ? 0 :  
>>> -1));
>>> +                    x->x_tag, u8_charnum(x->x_buf, selend_b)
>>> +			      + (sys_oldtclversion ? 0 : -1));
>>>               sys_vgui(".x%lx.c focus \"\"\n", canvas);
>>>           }
>>>           else
>>>           {
>>>               sys_vgui(".x%lx.c select clear\n", canvas);
>>>               sys_vgui(".x%lx.c icursor %s %d\n", canvas, x->x_tag,
>>> -                    selstart);
>>> +                    u8_charnum(x->x_buf, selstart_b));
>>>               sys_vgui(".x%lx.c focus %s\n", canvas, x->x_tag);
>>>           }
>>>       }
>>> @@ -448,12 +482,12 @@
>>>               ....
>>>           } */
>>>           if (x->x_selstart && (x->x_selstart == x->x_selend))
>>> -                x->x_selstart--;
>>> +                u8_dec(x->x_buf, &x->x_selstart);
>>>       }
>>>       else if (n == 127)      /* delete */
>>>       {
>>>           if (x->x_selend < x->x_bufsize && (x->x_selstart == x-
>>>> x_selend))
>>> -                x->x_selend++;
>>> +                u8_inc(x->x_buf, &x->x_selend);
>>>       }
>>>
>>>       ndel = x->x_selend - x->x_selstart;
>>> @@ -466,7 +500,13 @@
>>> /* at Guenter's suggestion, use 'n>31' to test wither a character
>>> might
>>> be printable in whatever 8-bit character set we find ourselves. */
>>>
>>> -        if (n == '\n' || (n > 31 && n != 127))
>>> +/*-- moo:
>>> +  ... but test with "<" rather than "!=" in order to accomodate
>>> unicode
>>> +  codepoints for n (which we get since Tk is sending the "%A"
>>> substitution
>>> +  for bind <Key>), effectively reducing the coverage of this clause
>>> to 7
>>> +  bits.  Case n>127 is covered by the next clause.
>>> +*/
>>> +        if (n == '\n' || (n > 31 && n < 127))
>>>       {
>>>           newsize = x->x_bufsize+1;
>>>           x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
>>> @@ -476,20 +516,39 @@
>>>           x->x_bufsize = newsize;
>>>           x->x_selstart = x->x_selstart + 1;
>>>       }
>>> +	/*--moo: check for unicode codepoints beyond 7-bit ASCII --*/
>>> +	else if (n > 127)
>>> +        {
>>> +            int ch_nbytes = u8_wc_nbytes(n);
>>> +            newsize = x->x_bufsize + ch_nbytes;
>>> +            x->x_buf = resizebytes(x->x_buf, x->x_bufsize,  
>>> newsize);
>>> +            for (i = x->x_bufsize; i > x->x_selstart; i--)
>>> +                x->x_buf[i] = x->x_buf[i-1];
>>> +            x->x_bufsize = newsize;
>>> +            /*-- moo: assume canvas_key() has encoded keysym as
>>> UTF-8 */
>>> +            strncpy(x->x_buf+x->x_selstart, keysym->s_name,
>>> ch_nbytes);
>>> +            x->x_selstart = x->x_selstart + ch_nbytes;
>>> +        }
>>>       x->x_selend = x->x_selstart;
>>>       x->x_glist->gl_editor->e_textdirty = 1;
>>>   }
>>>   else if (!strcmp(keysym->s_name, "Right"))
>>>   {
>>>       if (x->x_selend == x->x_selstart && x->x_selstart < x-
>>>> x_bufsize)
>>> -            x->x_selend = x->x_selstart = x->x_selstart + 1;
>>> +        {
>>> +            u8_inc(x->x_buf, &x->x_selstart);
>>> +            x->x_selend = x->x_selstart;
>>> +        }
>>>       else
>>>           x->x_selstart = x->x_selend;
>>>   }
>>>   else if (!strcmp(keysym->s_name, "Left"))
>>>   {
>>>       if (x->x_selend == x->x_selstart && x->x_selstart > 0)
>>> -            x->x_selend = x->x_selstart = x->x_selstart - 1;
>>> +        {
>>> +            u8_dec(x->x_buf, &x->x_selstart);
>>> +            x->x_selend = x->x_selstart;
>>> +        }
>>>       else
>>>           x->x_selend = x->x_selstart;
>>>   }
>>> @@ -497,18 +556,18 @@
>>>   else if (!strcmp(keysym->s_name, "Up"))
>>>   {
>>>       if (x->x_selstart)
>>> -            x->x_selstart--;
>>> +            u8_dec(x->x_buf, &x->x_selstart);
>>>       while (x->x_selstart > 0 && x->x_buf[x->x_selstart] != '\n')
>>> -            x->x_selstart--;
>>> +            u8_dec(x->x_buf, &x->x_selstart);
>>>       x->x_selend = x->x_selstart;
>>>   }
>>>   else if (!strcmp(keysym->s_name, "Down"))
>>>   {
>>>       while (x->x_selend < x->x_bufsize &&
>>>           x->x_buf[x->x_selend] != '\n')
>>> -            x->x_selend++;
>>> +            u8_inc(x->x_buf, &x->x_selend);
>>>       if (x->x_selend < x->x_bufsize)
>>> -            x->x_selend++;
>>> +            u8_inc(x->x_buf, &x->x_selend);
>>>       x->x_selstart = x->x_selend;
>>>   }
>>>   rtext_senditup(x, SEND_UPDATE, &w, &h, &indx);
>>> Index: src/s_utf8.h
>>> ===================================================================
>>> --- src/s_utf8.h	(revision 0)
>>> +++ src/s_utf8.h	(revision 0)
>>> @@ -0,0 +1,88 @@
>>> +#ifndef S_UTF8_H
>>> +#define S_UTF8_H
>>> +
>>> +/*--moo--*/
>>> +#ifndef u_int32_t
>>> +# define u_int32_t unsigned int
>>> +#endif
>>> +
>>> +#ifndef UCS4
>>> +# define UCS4 u_int32_t
>>> +#endif
>>> +
>>> +/* UTF8_SUPPORT_FULL_UCS4
>>> + *  define this to support the full potential range of UCS-4
>>> codepoints
>>> + *  (in anticipation of a future UTF-8 standard)
>>> + */
>>> +/*#define UTF8_SUPPORT_FULL_UCS4 1*/
>>> +#undef UTF8_SUPPORT_FULL_UCS4
>>> +
>>> +/* UTF8_MAXBYTES
>>> + *   maximum number of bytes required to represent a single
>>> character in UTF-8
>>> + *
>>> + * UTF8_MAXBYTES1 = UTF8_MAXBYTES+1
>>> + *  maximum bytes per character including NUL terminator
>>> + */
>>> +#ifdef UTF8_SUPPORT_FULL_UCS4
>>> +# ifndef UTF8_MAXBYTES
>>> +#  define UTF8_MAXBYTES  6
>>> +# endif
>>> +# ifndef UTF8_MAXBYTES1
>>> +#  define UTF8_MAXBYTES1 7
>>> +# endif
>>> +#else
>>> +# ifndef UTF8_MAXBYTES
>>> +#  define UTF8_MAXBYTES  4
>>> +# endif
>>> +# ifndef UTF8_MAXBYTES1
>>> +#  define UTF8_MAXBYTES1 5
>>> +# endif
>>> +#endif
>>> +/*--/moo--*/
>>> +
>>> +/* is c the start of a utf8 sequence? */
>>> +#define isutf(c) (((c)&0xC0)!=0x80)
>>> +
>>> +/* convert UTF-8 data to wide character */
>>> +int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz);
>>> +
>>> +/* the opposite conversion */
>>> +int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
>>> +
>>> +/* moo: get byte length of character number, or 0 if not  
>>> supported */
>>> +int u8_wc_nbytes(u_int32_t ch);
>>> +
>>> +/* moo: compute required storage for UTF-8 encoding of  
>>> 's[0..n-1]' */
>>> +int u8_wcs_nbytes(u_int32_t *ucs, int size);
>>> +
>>> +/* single character to UTF-8, no NUL termination */
>>> +int u8_wc_toutf8(char *dest, u_int32_t ch);
>>> +
>>> +/* moo: single character to UTF-8, with NUL termination */
>>> +int u8_wc_toutf8_nul(char *dest, u_int32_t ch);
>>> +
>>> +/* character number to byte offset */
>>> +int u8_offset(char *str, int charnum);
>>> +
>>> +/* byte offset to character number */
>>> +int u8_charnum(char *s, int offset);
>>> +
>>> +/* return next character, updating an index variable */
>>> +u_int32_t u8_nextchar(char *s, int *i);
>>> +
>>> +/* move to next character */
>>> +void u8_inc(char *s, int *i);
>>> +
>>> +/* move to previous character */
>>> +void u8_dec(char *s, int *i);
>>> +
>>> +/* moo: move pointer to next character */
>>> +void u8_inc_ptr(char **sp);
>>> +
>>> +/* moo: move pointer to previous character */
>>> +void u8_dec_ptr(char **sp);
>>> +
>>> +/* returns length of next utf-8 sequence */
>>> +int u8_seqlen(char *s);
>>> +
>>> +#endif /* S_UTF8_H */
>>> <test-utf8.pd>
>>
>>
>>
>>
>>
>> ----------------------------------------------------------------------------
>>
>> "[T]he greatest purveyor of violence in the world today [is] my own
>> government." - Martin Luther King, Jr.
>>
>>
>>
>>
>> _______________________________________________
>> Pd-dev mailing list
>> Pd-dev at iem.at
>> http://lists.puredata.info/listinfo/pd-dev
>
> _______________________________________________
> Pd-dev mailing list
> Pd-dev at iem.at
> http://lists.puredata.info/listinfo/pd-dev

----------------------------------------------------------------------------

Man has survived hitherto because he was too ignorant to know how to  
realize his wishes.  Now that he can realize them, he must either  
change them, or perish.    -William Carlos Williams