[PD-dev] UTF-8 for pd-devel (again)
Miller Puckette
mpuckett at imusic1.ucsd.edu
Wed Jan 20 04:15:04 CET 2010
127 is 'delete' -- ascii all right, but not 'printable'.
cheers
Miller
On Tue, Jan 19, 2010 at 09:37:08PM -0500, Hans-Christoph Steiner wrote:
>
> Looks good to me. One comment, shouldn't this be n<128? 127 is an
> ASCII char, AFAIK.
>
> + if (n == '\n' || (n > 31 && n < 127))
>
> It looks worth checking to me, hopefully we can get Miller and others
> to weigh in on it.
>
> .hc
>
> On Jan 19, 2010, at 4:16 PM, Bryan Jurish wrote:
>
> >morning all,
> >
> >attached is a UTF-8 support patch against branches/pd-gui-rewrite/0.43
> >revision 13051 (HEAD as of an hour or so ago). most of the bulk is
> >new
> >files (s_utf8.c, s_utf8.h), most other changes are in g_rtext.c. It's
> >not too monstrous, and I've tested it again here briefly with some
> >utf-8
> >test patches (see other attachment), and things appear to be working
> >as
> >expected. if desired, I can check this in; otherwise feel free to
> >do it
> >for me ;-)
> >
> >2 annoying things here during testing (I don't see how my patches
> >could
> >have caused this, but you never know):
> >
> >(1) all loaded patch windows appear at +0+0 (upper left corner), which
> >with my wm (windowmaker) means the title bar is off the screen, and I
> >have to resort to keyboard shortcuts to get them mouse-draggable,
> >which
> >is a major pain in the wazoo: is this a known bug?
> >
> >(2) I can't figure out how to get at the properties dialog for number,
> >number2, or any other gui-atom objects: should these be working
> >already?
> >
> >marmosets,
> > Bryan
> >
> >On 2010-01-18 23:09:34, Hans-Christoph Steiner <hans at eds.org>
> >appears to
> >have written:
> >>
> >>Awesome! If its big and complicated, I say post it to the list
> >>first,
> >>if not too bad, then just commit.
> >>
> >>.hc
> >>
> >>On Jan 18, 2010, at 4:47 AM, Bryan Jurish wrote:
> >>
> >>>moin Hans, moin list,
> >>>
> >>>I think perhaps I never actually did post the cleaned-up patch
> >>>anywhere
> >>>(bad programmer, no biscuit); I guess I'll check out
> >>>branches/pd-gui-rewrite/0.43 and try patching my changes in; then
> >>>I can
> >>>either commit or just post the (updated) patch. Hopefully no major
> >>>additional changes will be required, so it ought to go pretty fast.
> >>>
> >>>marmosets,
> >>> Bryan
> >>>
> >>>On 2010-01-17 22:57:33, Hans-Christoph Steiner <hans at eds.org>
> >>>appears to
> >>>have written:
> >>>>
> >>>>Hey Bryan,
> >>>>
> >>>>I'd like to try to get your UTF-8 code into pd-gui-rewrite. You
> >>>>mention
> >>>>in this posting back in May that you had the whole thing
> >>>>working. I
> >>>>couldn't find the diff/patch for this. Is it posted anywhere?
> >>>>Do you
> >>>>want to try to check it in yourself directly to the pd-gui-
> >>>>rewrite/0.43
> >>>>branch?
> >>>>
> >>>>.hc
> >>>>
> >>>>
> >>>>On Mar 20, 2009, at 6:16 PM, Bryan Jurish wrote:
> >>>>
> >>>>>morning all,
> >>>>>
> >>>>>Of course I never really like to see my code wither away in the
> >>>>>bit
> >>>>>bucket, but I personally don't have any pressing need for UTF-8
> >>>>>symbols,
> >>>>>comments, etc. in Pd -- I'm a native English speaker, after
> >>>>>all ;-)
> >>>>>
> >>>>>Also, my changes are by no means the only way to do it (or even
> >>>>>the
> >>>>>best
> >>>>>way); we could gain a little speed by slapping on some more
> >>>>>buffers
> >>>>>(mostly and possibly only in rtext_senditup()), but since this
> >>>>>seems to
> >>>>>effect only GUI/editing stuff, I think we can live with a
> >>>>>smidgeon of
> >>>>>additional cpu time ... after all, it's all O(n) anyways.
> >>>>>
> >>>>>Really I just wanted to see how easy (or difficult) it would be
> >>>>>to get
> >>>>>Pd to use UTF-8 as its internal encoding... turned out to be
> >>>>>harder
> >>>>>than
> >>>>>I had thought, but (ever so slightly) easier than I had feared :-/
> >>>>>
> >>>>>marmosets,
> >>>>> Bryan
> >>>>>
> >>>>>On 2009-03-20 18:39:06, Hans-Christoph Steiner <hans at eds.org>
> >>>>>appears to
> >>>>>have written:
> >>>>>>
> >>>>>>I wonder what the best approach is to getting it included. I
> >>>>>>also
> >>>>>>think
> >>>>>>its a very valuable contribution. I think we need to first get
> >>>>>>the
> >>>>>>Tcl/Tk only changes done, since that was the mandate of the pd-
> >>>>>>devel
> >>>>>>0.41 effort. Then once Miller has accepted those changes, then
> >>>>>>we can
> >>>>>>start with the C modifications there. So how to proceed next,
> >>>>>>I think
> >>>>>>is based on how eager you are, Bryan, to getting this in a
> >>>>>>regular
> >>>>>>build.
> >>>>>>
> >>>>>>One option is making a pd-devel-utf8 branch, another is posting
> >>>>>>these
> >>>>>>patches to the patch tracker and waiting for Miller to make his
> >>>>>>next
> >>>>>>update with the Pd-devel Tcl-Tk code.
> >>>>>>
> >>>>>>Maybe we can get Miller to chime in on this topic.
> >>>>>>
> >>>>>>.hc
> >>>>>>
> >>>>>>On Mar 13, 2009, at 12:00 AM, dmotd wrote:
> >>>>>>
> >>>>>>>hey bryan,
> >>>>>>>
> >>>>>>>just a quick note of a appreciation for getting this one out..
> >>>>>>>i hope
> >>>>>>>it gets
> >>>>>>>picked up in millers build soon.. a very useful and necessary
> >>>>>>>modification.
> >>>>>>>
> >>>>>>>well done!
> >>>>>>>
> >>>>>>>dmotd
> >>>>>>>
> >>>>>>>On Thursday 12 March 2009 08:07:50 Bryan Jurish wrote:
> >>>>>>>>moin folks,
> >>>>>>>>
> >>>>>>>>I believe I've finally got pd-devel 0.41-4 using UTF-8 across
> >>>>>>>>the
> >>>>>>>>board.
> >>>>>>>>So far, I've tested message boxes & comments (g_rtext), as
> >>>>>>>>well as
> >>>>>>>>symbol atoms, and all seems good. I think we can still expect
> >>>>>>>>goofiness
> >>>>>>>>if someone names an abstraction using a multibyte character
> >>>>>>>>when the
> >>>>>>>>filesystem isn't UTF-8 encoded (raw 8-bit works for me here
> >>>>>>>>too),
> >>>>>>>>but I
> >>>>>>>>really don't want to open that particular can of worms.
> >>>>>>>>
> >>>>>>>>So I guess I have 2 questions:
> >>>>>>>>
> >>>>>>>>(1) what should I call the generic UTF-8 source files? (see
> >>>>>>>>my other
> >>>>>>>>post)
> >>>>>>>>
> >>>>>>>>(2) shall I commit these changes to pd-devel/0.41-4, or
> >>>>>>>>somewhere
> >>>>>>>>else,
> >>>>>>>>or just post a diff (ca. 33k, ought to be easier to read now;
> >>>>>>>>I've
> >>>>>>>>tried
> >>>>>>>>to follow the indentation conventions of the source files I
> >>>>>>>>modified)?
> >>>>>>>>
> >>>>>>>>marmosets,
> >>>>>>>> Bryan
> >>>>>
> >>>>>--
> >>>>>Bryan Jurish "There is *always* one more
> >>>>>bug."
> >>>>>jurish at ling.uni-potsdam.de -Lubarsky's Law of Cybernetic
> >>>>>Entomology
> >>>>
> >>>>
> >>>>
> >>>>----------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>>The arc of history bends towards justice. - Dr. Martin Luther
> >>>>King, Jr.
> >>>>
> >>>>
> >>>
> >>>--
> >>>***************************************************
> >>>
> >>>Bryan Jurish
> >>>Deutsches Textarchiv
> >>>Berlin-Brandenburgische Akademie der Wissenschaften
> >>>
> >>>J?gerstr. 22/23
> >>>10117 Berlin
> >>>
> >>>Tel.: +49 (0)30 20370 539
> >>>E-Mail: jurish at bbaw.de
> >>>
> >>>***************************************************
> >>>
> >>
> >>
> >>
> >>----------------------------------------------------------------------------
> >>
> >>
> >>As we enjoy great advantages from inventions of others, we should be
> >>glad of an opportunity to serve others by any invention of ours; and
> >>this we should do freely and generously. - Benjamin Franklin
> >>
> >>
> >>
> >
> >--
> >Bryan Jurish "There is *always* one more bug."
> >jurish at uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology
> >Index: src/Makefile.am
> >===================================================================
> >--- src/Makefile.am (revision 13051)
> >+++ src/Makefile.am (working copy)
> >@@ -24,6 +24,7 @@
> > m_conf.c m_glob.c m_sched.c \
> > s_main.c s_inter.c s_file.c s_print.c \
> > s_loader.c s_path.c s_entry.c s_audio.c s_midi.c \
> >+ s_utf8.c \
> > d_ugen.c d_ctl.c d_arithmetic.c d_osc.c d_filter.c d_dac.c
> >d_misc.c \
> > d_math.c d_fft.c d_array.c d_global.c \
> > d_delay.c d_resample.c \
> >Index: src/g_editor.c
> >===================================================================
> >--- src/g_editor.c (revision 13051)
> >+++ src/g_editor.c (working copy)
> >@@ -9,6 +9,7 @@
> >#include "s_stuff.h"
> >#include "g_canvas.h"
> >#include <string.h>
> >+#include "s_utf8.h" /*-- moo --*/
> >
> >void glist_readfrombinbuf(t_glist *x, t_binbuf *b, char *filename,
> > int selectem);
> >@@ -1666,8 +1667,9 @@
> > gotkeysym = av[1].a_w.w_symbol;
> > else if (av[1].a_type == A_FLOAT)
> > {
> >- char buf[3];
> >- sprintf(buf, "%c", (int)(av[1].a_w.w_float));
> >+ /*-- moo: assume keynum is a Unicode codepoint; encode as
> >UTF-8 --*/
> >+ char buf[UTF8_MAXBYTES1];
> >+ u8_wc_toutf8_nul(buf, (UCS4)(av[1].a_w.w_float));
> > gotkeysym = gensym(buf);
> > }
> > else gotkeysym = gensym("?");
> >Index: src/s_utf8.c
> >===================================================================
> >--- src/s_utf8.c (revision 0)
> >+++ src/s_utf8.c (revision 0)
> >@@ -0,0 +1,280 @@
> >+/*
> >+ Basic UTF-8 manipulation routines
> >+ by Jeff Bezanson
> >+ placed in the public domain Fall 2005
> >+
> >+ This code is designed to provide the utilities you need to
> >manipulate
> >+ UTF-8 as an internal string encoding. These functions do not
> >perform the
> >+ error checking normally needed when handling UTF-8 data, so if
> >you happen
> >+ to be from the Unicode Consortium you will want to flay me alive.
> >+ I do this because error checking can be performed at the
> >boundaries (I/O),
> >+ with these routines reserved for higher performance on data known
> >to be
> >+ valid.
> >+
> >+ modified by Bryan Jurish (moo) March 2009
> >+ + removed some unneeded functions (escapes, printf etc), added
> >others
> >+*/
> >+#include <stdlib.h>
> >+#include <stdio.h>
> >+#include <string.h>
> >+#include <stdarg.h>
> >+#ifdef WIN32
> >+#include <malloc.h>
> >+#else
> >+#include <alloca.h>
> >+#endif
> >+
> >+#include "s_utf8.h"
> >+
> >+static const u_int32_t offsetsFromUTF8[6] = {
> >+ 0x00000000UL, 0x00003080UL, 0x000E2080UL,
> >+ 0x03C82080UL, 0xFA082080UL, 0x82082080UL
> >+};
> >+
> >+static const char trailingBytesForUTF8[256] = {
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> >+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> >+ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> >+};
> >+
> >+
> >+/* returns length of next utf-8 sequence */
> >+int u8_seqlen(char *s)
> >+{
> >+ return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]]
> >+ 1;
> >+}
> >+
> >+/* conversions without error checking
> >+ only works for valid UTF-8, i.e. no 5- or 6-byte sequences
> >+ srcsz = source size in bytes, or -1 if 0-terminated
> >+ sz = dest size in # of wide characters
> >+
> >+ returns # characters converted
> >+ dest will always be L'\0'-terminated, even if there isn't enough
> >room
> >+ for all the characters.
> >+ if sz = srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be
> >enough space.
> >+*/
> >+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz)
> >+{
> >+ u_int32_t ch;
> >+ char *src_end = src + srcsz;
> >+ int nb;
> >+ int i=0;
> >+
> >+ while (i < sz-1) {
> >+ nb = trailingBytesForUTF8[(unsigned char)*src];
> >+ if (srcsz == -1) {
> >+ if (*src == 0)
> >+ goto done_toucs;
> >+ }
> >+ else {
> >+ if (src + nb >= src_end)
> >+ goto done_toucs;
> >+ }
> >+ ch = 0;
> >+ switch (nb) {
> >+ /* these fall through deliberately */
> >+#if UTF8_SUPPORT_FULL_UCS4
> >+ case 5: ch += (unsigned char)*src++; ch <<= 6;
> >+ case 4: ch += (unsigned char)*src++; ch <<= 6;
> >+#endif
> >+ case 3: ch += (unsigned char)*src++; ch <<= 6;
> >+ case 2: ch += (unsigned char)*src++; ch <<= 6;
> >+ case 1: ch += (unsigned char)*src++; ch <<= 6;
> >+ case 0: ch += (unsigned char)*src++;
> >+ }
> >+ ch -= offsetsFromUTF8[nb];
> >+ dest[i++] = ch;
> >+ }
> >+ done_toucs:
> >+ dest[i] = 0;
> >+ return i;
> >+}
> >+
> >+/* srcsz = number of source characters, or -1 if 0-terminated
> >+ sz = size of dest buffer in bytes
> >+
> >+ returns # characters converted
> >+ dest will only be '\0'-terminated if there is enough space. this
> >is
> >+ for consistency; imagine there are 2 bytes of space left, but
> >the next
> >+ character requires 3 bytes. in this case we could NUL-terminate,
> >but in
> >+ general we can't when there's insufficient space. therefore this
> >function
> >+ only NUL-terminates if all the characters fit, and there's space
> >for
> >+ the NUL as well.
> >+ the destination string will never be bigger than the source
> >string.
> >+*/
> >+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz)
> >+{
> >+ u_int32_t ch;
> >+ int i = 0;
> >+ char *dest_end = dest + sz;
> >+
> >+ while (srcsz<0 ? src[i]!=0 : i < srcsz) {
> >+ ch = src[i];
> >+ if (ch < 0x80) {
> >+ if (dest >= dest_end)
> >+ return i;
> >+ *dest++ = (char)ch;
> >+ }
> >+ else if (ch < 0x800) {
> >+ if (dest >= dest_end-1)
> >+ return i;
> >+ *dest++ = (ch>>6) | 0xC0;
> >+ *dest++ = (ch & 0x3F) | 0x80;
> >+ }
> >+ else if (ch < 0x10000) {
> >+ if (dest >= dest_end-2)
> >+ return i;
> >+ *dest++ = (ch>>12) | 0xE0;
> >+ *dest++ = ((ch>>6) & 0x3F) | 0x80;
> >+ *dest++ = (ch & 0x3F) | 0x80;
> >+ }
> >+ else if (ch < 0x110000) {
> >+ if (dest >= dest_end-3)
> >+ return i;
> >+ *dest++ = (ch>>18) | 0xF0;
> >+ *dest++ = ((ch>>12) & 0x3F) | 0x80;
> >+ *dest++ = ((ch>>6) & 0x3F) | 0x80;
> >+ *dest++ = (ch & 0x3F) | 0x80;
> >+ }
> >+ i++;
> >+ }
> >+ if (dest < dest_end)
> >+ *dest = '\0';
> >+ return i;
> >+}
> >+
> >+/* moo: get byte length of character number, or 0 if not supported */
> >+int u8_wc_nbytes(u_int32_t ch)
> >+{
> >+ if (ch < 0x80) return 1;
> >+ if (ch < 0x800) return 2;
> >+ if (ch < 0x10000) return 3;
> >+ if (ch < 0x200000) return 4;
> >+#if UTF8_SUPPORT_FULL_UCS4
> >+ /*-- moo: support full UCS-4 range? --*/
> >+ if (ch < 0x4000000) return 5;
> >+ if (ch < 0x7fffffffUL) return 6;
> >+#endif
> >+ return 0; /*-- bad input --*/
> >+}
> >+
> >+int u8_wc_toutf8(char *dest, u_int32_t ch)
> >+{
> >+ if (ch < 0x80) {
> >+ dest[0] = (char)ch;
> >+ return 1;
> >+ }
> >+ if (ch < 0x800) {
> >+ dest[0] = (ch>>6) | 0xC0;
> >+ dest[1] = (ch & 0x3F) | 0x80;
> >+ return 2;
> >+ }
> >+ if (ch < 0x10000) {
> >+ dest[0] = (ch>>12) | 0xE0;
> >+ dest[1] = ((ch>>6) & 0x3F) | 0x80;
> >+ dest[2] = (ch & 0x3F) | 0x80;
> >+ return 3;
> >+ }
> >+ if (ch < 0x110000) {
> >+ dest[0] = (ch>>18) | 0xF0;
> >+ dest[1] = ((ch>>12) & 0x3F) | 0x80;
> >+ dest[2] = ((ch>>6) & 0x3F) | 0x80;
> >+ dest[3] = (ch & 0x3F) | 0x80;
> >+ return 4;
> >+ }
> >+ return 0;
> >+}
> >+
> >+/*-- moo --*/
> >+int u8_wc_toutf8_nul(char *dest, u_int32_t ch)
> >+{
> >+ int sz = u8_wc_toutf8(dest,ch);
> >+ dest[sz] = '\0';
> >+ return sz;
> >+}
> >+
> >+/* charnum => byte offset */
> >+int u8_offset(char *str, int charnum)
> >+{
> >+ int offs=0;
> >+
> >+ while (charnum > 0 && str[offs]) {
> >+ (void)(isutf(str[++offs]) || isutf(str[++offs]) ||
> >+ isutf(str[++offs]) || ++offs);
> >+ charnum--;
> >+ }
> >+ return offs;
> >+}
> >+
> >+/* byte offset => charnum */
> >+int u8_charnum(char *s, int offset)
> >+{
> >+ int charnum = 0, offs=0;
> >+
> >+ while (offs < offset && s[offs]) {
> >+ (void)(isutf(s[++offs]) || isutf(s[++offs]) ||
> >+ isutf(s[++offs]) || ++offs);
> >+ charnum++;
> >+ }
> >+ return charnum;
> >+}
> >+
> >+/* reads the next utf-8 sequence out of a string, updating an index
> >*/
> >+u_int32_t u8_nextchar(char *s, int *i)
> >+{
> >+ u_int32_t ch = 0;
> >+ int sz = 0;
> >+
> >+ do {
> >+ ch <<= 6;
> >+ ch += (unsigned char)s[(*i)++];
> >+ sz++;
> >+ } while (s[*i] && !isutf(s[*i]));
> >+ ch -= offsetsFromUTF8[sz-1];
> >+
> >+ return ch;
> >+}
> >+
> >+/* number of characters */
> >+int u8_strlen(char *s)
> >+{
> >+ int count = 0;
> >+ int i = 0;
> >+
> >+ while (u8_nextchar(s, &i) != 0)
> >+ count++;
> >+
> >+ return count;
> >+}
> >+
> >+void u8_inc(char *s, int *i)
> >+{
> >+ (void)(isutf(s[++(*i)]) || isutf(s[++(*i)]) ||
> >+ isutf(s[++(*i)]) || ++(*i));
> >+}
> >+
> >+void u8_dec(char *s, int *i)
> >+{
> >+ (void)(isutf(s[--(*i)]) || isutf(s[--(*i)]) ||
> >+ isutf(s[--(*i)]) || --(*i));
> >+}
> >+
> >+/*-- moo --*/
> >+void u8_inc_ptr(char **sp)
> >+{
> >+ (void)(isutf(*(++(*sp))) || isutf(*(++(*sp))) ||
> >+ isutf(*(++(*sp))) || ++(*sp));
> >+}
> >+
> >+/*-- moo --*/
> >+void u8_dec_ptr(char **sp)
> >+{
> >+ (void)(isutf(*(--(*sp))) || isutf(*(--(*sp))) ||
> >+ isutf(*(--(*sp))) || --(*sp));
> >+}
> >Index: src/g_rtext.c
> >===================================================================
> >--- src/g_rtext.c (revision 13051)
> >+++ src/g_rtext.c (working copy)
> >@@ -13,6 +13,7 @@
> >#include "m_pd.h"
> >#include "s_stuff.h"
> >#include "g_canvas.h"
> >+#include "s_utf8.h"
> >
> >
> >#define LMARGIN 2
> >@@ -32,10 +33,10 @@
> >
> >struct _rtext
> >{
> >- char *x_buf;
> >- int x_bufsize;
> >- int x_selstart;
> >- int x_selend;
> >+ char *x_buf; /*-- raw byte string, assumed UTF-8 encoded
> >(moo) --*/
> >+ int x_bufsize; /*-- byte length --*/
> >+ int x_selstart; /*-- byte offset --*/
> >+ int x_selend; /*-- byte offset --*/
> > int x_active;
> > int x_dragfrom;
> > int x_height;
> >@@ -119,6 +120,15 @@
> >
> >/* LATER deal with tcl-significant characters */
> >
> >+/* firstone(), lastone()
> >+ * + returns byte offset of (first|last) occurrence of 'c' in
> >'s[0..n-1]', or
> >+ * -1 if none was found
> >+ * + 's' is a raw byte string
> >+ * + 'c' is a byte value
> >+ * + 'n' is the length (in bytes) of the prefix of 's' to be
> >searched.
> >+ * + we could make these functions work on logical characters in
> >utf8 strings,
> >+ * but we don't really need to...
> >+ */
> >static int firstone(char *s, int c, int n)
> >{
> > char *s2 = s + n;
> >@@ -155,6 +165,16 @@
> > of the entire text in pixels.
> > */
> >
> >+ /*-- moo:
> >+ * + some variables from the original version have been renamed
> >+ * + variables with a "_b" suffix are raw byte strings, lengths,
> >or offsets
> >+ * + variables with a "_c" suffix are logical character lengths
> >or offsets
> >+ * (assuming valid UTF-8 encoded byte string in x->x_buf)
> >+ * + a fair amount of O(n) computations required to convert
> >between raw byte
> >+ * offsets (needed by the C side) and logical character
> >offsets (needed by
> >+ * the GUI)
> >+ */
> >+
> > /* LATER get this and sys_vgui to work together properly,
> > breaking up messages as needed. As of now, there's
> > a limit of 1950 characters, imposed by sys_vgui(). */
> >@@ -171,14 +191,16 @@
> >{
> > t_float dispx, dispy;
> > char smallbuf[200], *tempbuf;
> >- int outchars = 0, nlines = 0, ncolumns = 0,
> >+ int outchars_b = 0, nlines = 0, ncolumns = 0,
> > pixwide, pixhigh, font, fontwidth, fontheight, findx, findy;
> > int reportedindex = 0;
> > t_canvas *canvas = glist_getcanvas(x->x_glist);
> >- int widthspec = x->x_text->te_width;
> >- int widthlimit = (widthspec ? widthspec : BOXWIDTH);
> >- int inindex = 0;
> >- int selstart = 0, selend = 0;
> >+ int widthspec_c = x->x_text->te_width;
> >+ int widthlimit_c = (widthspec_c ? widthspec_c : BOXWIDTH);
> >+ int inindex_b = 0;
> >+ int inindex_c = 0;
> >+ int selstart_b = 0, selend_b = 0;
> >+ int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);
> > /* if we're a GOP (the new, "goprect" style) borrow the font
> >size
> > from the inside to preserve the spacing */
> > if (pd_class(&x->x_text->te_pd) == canvas_class &&
> >@@ -193,65 +215,76 @@
> > if (x->x_bufsize >= 100)
> > tempbuf = (char *)t_getbytes(2 * x->x_bufsize + 1);
> > else tempbuf = smallbuf;
> >- while (x->x_bufsize - inindex > 0)
> >+ while (x_bufsize_c - inindex_c > 0)
> > {
> >- int inchars = x->x_bufsize - inindex;
> >- int maxindex = (inchars > widthlimit ? widthlimit : inchars);
> >+ int inchars_b = x->x_bufsize - inindex_b;
> >+ int inchars_c = x_bufsize_c - inindex_c;
> >+ int maxindex_c = (inchars_c > widthlimit_c ? widthlimit_c :
> >inchars_c);
> >+ int maxindex_b = u8_offset(x->x_buf + inindex_b, maxindex_c);
> > int eatchar = 1;
> >- int foundit = firstone(x->x_buf + inindex, '\n', maxindex);
> >- if (foundit < 0)
> >+ int foundit_b = firstone(x->x_buf + inindex_b, '\n',
> >maxindex_b);
> >+ int foundit_c;
> >+ if (foundit_b < 0)
> > {
> >- if (inchars > widthlimit)
> >+ if (inchars_c > widthlimit_c)
> > {
> >- foundit = lastone(x->x_buf + inindex, ' ', maxindex);
> >- if (foundit < 0)
> >+ foundit_b = lastone(x->x_buf + inindex_b, ' ',
> >maxindex_b);
> >+ if (foundit_b < 0)
> > {
> >- foundit = maxindex;
> >+ foundit_b = maxindex_b;
> >+ foundit_c = maxindex_c;
> > eatchar = 0;
> > }
> >+ else
> >+ foundit_c = u8_charnum(x->x_buf + inindex_b,
> >foundit_b);
> > }
> > else
> > {
> >- foundit = inchars;
> >+ foundit_b = inchars_b;
> >+ foundit_c = inchars_c;
> > eatchar = 0;
> > }
> > }
> >+ else
> >+ foundit_c = u8_charnum(x->x_buf + inindex_b, foundit_b);
> >+
> > if (nlines == findy)
> > {
> > int actualx = (findx < 0 ? 0 :
> >- (findx > foundit ? foundit : findx));
> >- *indexp = inindex + actualx;
> >+ (findx > foundit_c ? foundit_c : findx));
> >+ *indexp = inindex_b + u8_offset(x->x_buf + inindex_b,
> >actualx);
> > reportedindex = 1;
> > }
> >- strncpy(tempbuf+outchars, x->x_buf + inindex, foundit);
> >- if (x->x_selstart >= inindex &&
> >- x->x_selstart <= inindex + foundit + eatchar)
> >- selstart = x->x_selstart + outchars - inindex;
> >- if (x->x_selend >= inindex &&
> >- x->x_selend <= inindex + foundit + eatchar)
> >- selend = x->x_selend + outchars - inindex;
> >- outchars += foundit;
> >- inindex += (foundit + eatchar);
> >- if (inindex < x->x_bufsize)
> >- tempbuf[outchars++] = '\n';
> >- if (foundit > ncolumns)
> >- ncolumns = foundit;
> >+ strncpy(tempbuf+outchars_b, x->x_buf + inindex_b, foundit_b);
> >+ if (x->x_selstart >= inindex_b &&
> >+ x->x_selstart <= inindex_b + foundit_b + eatchar)
> >+ selstart_b = x->x_selstart + outchars_b - inindex_b;
> >+ if (x->x_selend >= inindex_b &&
> >+ x->x_selend <= inindex_b + foundit_b + eatchar)
> >+ selend_b = x->x_selend + outchars_b - inindex_b;
> >+ outchars_b += foundit_b;
> >+ inindex_b += (foundit_b + eatchar);
> >+ inindex_c += (foundit_c + eatchar);
> >+ if (inindex_b < x->x_bufsize)
> >+ tempbuf[outchars_b++] = '\n';
> >+ if (foundit_c > ncolumns)
> >+ ncolumns = foundit_c;
> > nlines++;
> > }
> > if (!reportedindex)
> >- *indexp = outchars;
> >+ *indexp = outchars_b;
> > dispx = text_xpix(x->x_text, x->x_glist);
> > dispy = text_ypix(x->x_text, x->x_glist);
> > if (nlines < 1) nlines = 1;
> >- if (!widthspec)
> >+ if (!widthspec_c)
> > {
> > while (ncolumns < 3)
> > {
> >- tempbuf[outchars++] = ' ';
> >+ tempbuf[outchars_b++] = ' ';
> > ncolumns++;
> > }
> > }
> >- else ncolumns = widthspec;
> >+ else ncolumns = widthspec_c;
> > pixwide = ncolumns * fontwidth + (LMARGIN + RMARGIN);
> > pixhigh = nlines * fontheight + (TMARGIN + BMARGIN);
> >
> >@@ -259,31 +292,32 @@
> > sys_vgui("pdtk_text_new .x%lx.c {%s %s text} %f %f {%.*s} %d
> >%s\n",
> > canvas, x->x_tag, rtext_gettype(x)->s_name,
> > dispx + LMARGIN, dispy + TMARGIN,
> >- outchars, tempbuf, sys_hostfontsize(font),
> >+ outchars_b, tempbuf, sys_hostfontsize(font),
> > (glist_isselected(x->x_glist,
> > &x->x_glist->gl_gobj)? "blue" : "black"));
> > else if (action == SEND_UPDATE)
> > {
> > sys_vgui("pdtk_text_set .x%lx.c %s {%.*s}\n",
> >- canvas, x->x_tag, outchars, tempbuf);
> >+ canvas, x->x_tag, outchars_b, tempbuf);
> > if (pixwide != x->x_drawnwidth || pixhigh != x->x_drawnheight)
> > text_drawborder(x->x_text, x->x_glist, x->x_tag,
> > pixwide, pixhigh, 0);
> > if (x->x_active)
> > {
> >- if (selend > selstart)
> >+ if (selend_b > selstart_b)
> > {
> > sys_vgui(".x%lx.c select from %s %d\n", canvas,
> >- x->x_tag, selstart);
> >+ x->x_tag, u8_charnum(x->x_buf, selstart_b));
> > sys_vgui(".x%lx.c select to %s %d\n", canvas,
> >- x->x_tag, selend + (sys_oldtclversion ? 0 : -1));
> >+ x->x_tag, u8_charnum(x->x_buf, selend_b)
> >+ + (sys_oldtclversion ? 0 : -1));
> > sys_vgui(".x%lx.c focus \"\"\n", canvas);
> > }
> > else
> > {
> > sys_vgui(".x%lx.c select clear\n", canvas);
> > sys_vgui(".x%lx.c icursor %s %d\n", canvas, x->x_tag,
> >- selstart);
> >+ u8_charnum(x->x_buf, selstart_b));
> > sys_vgui(".x%lx.c focus %s\n", canvas, x->x_tag);
> > }
> > }
> >@@ -448,12 +482,12 @@
> > ....
> > } */
> > if (x->x_selstart && (x->x_selstart == x->x_selend))
> >- x->x_selstart--;
> >+ u8_dec(x->x_buf, &x->x_selstart);
> > }
> > else if (n == 127) /* delete */
> > {
> > if (x->x_selend < x->x_bufsize && (x->x_selstart == x-
> >>x_selend))
> >- x->x_selend++;
> >+ u8_inc(x->x_buf, &x->x_selend);
> > }
> >
> > ndel = x->x_selend - x->x_selstart;
> >@@ -466,7 +500,13 @@
> >/* at Guenter's suggestion, use 'n>31' to test wither a character
> >might
> >be printable in whatever 8-bit character set we find ourselves. */
> >
> >- if (n == '\n' || (n > 31 && n != 127))
> >+/*-- moo:
> >+ ... but test with "<" rather than "!=" in order to accomodate
> >unicode
> >+ codepoints for n (which we get since Tk is sending the "%A"
> >substitution
> >+ for bind <Key>), effectively reducing the coverage of this clause
> >to 7
> >+ bits. Case n>127 is covered by the next clause.
> >+*/
> >+ if (n == '\n' || (n > 31 && n < 127))
> > {
> > newsize = x->x_bufsize+1;
> > x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> >@@ -476,20 +516,39 @@
> > x->x_bufsize = newsize;
> > x->x_selstart = x->x_selstart + 1;
> > }
> >+ /*--moo: check for unicode codepoints beyond 7-bit ASCII --*/
> >+ else if (n > 127)
> >+ {
> >+ int ch_nbytes = u8_wc_nbytes(n);
> >+ newsize = x->x_bufsize + ch_nbytes;
> >+ x->x_buf = resizebytes(x->x_buf, x->x_bufsize, newsize);
> >+ for (i = x->x_bufsize; i > x->x_selstart; i--)
> >+ x->x_buf[i] = x->x_buf[i-1];
> >+ x->x_bufsize = newsize;
> >+ /*-- moo: assume canvas_key() has encoded keysym as
> >UTF-8 */
> >+ strncpy(x->x_buf+x->x_selstart, keysym->s_name,
> >ch_nbytes);
> >+ x->x_selstart = x->x_selstart + ch_nbytes;
> >+ }
> > x->x_selend = x->x_selstart;
> > x->x_glist->gl_editor->e_textdirty = 1;
> > }
> > else if (!strcmp(keysym->s_name, "Right"))
> > {
> > if (x->x_selend == x->x_selstart && x->x_selstart < x-
> >>x_bufsize)
> >- x->x_selend = x->x_selstart = x->x_selstart + 1;
> >+ {
> >+ u8_inc(x->x_buf, &x->x_selstart);
> >+ x->x_selend = x->x_selstart;
> >+ }
> > else
> > x->x_selstart = x->x_selend;
> > }
> > else if (!strcmp(keysym->s_name, "Left"))
> > {
> > if (x->x_selend == x->x_selstart && x->x_selstart > 0)
> >- x->x_selend = x->x_selstart = x->x_selstart - 1;
> >+ {
> >+ u8_dec(x->x_buf, &x->x_selstart);
> >+ x->x_selend = x->x_selstart;
> >+ }
> > else
> > x->x_selend = x->x_selstart;
> > }
> >@@ -497,18 +556,18 @@
> > else if (!strcmp(keysym->s_name, "Up"))
> > {
> > if (x->x_selstart)
> >- x->x_selstart--;
> >+ u8_dec(x->x_buf, &x->x_selstart);
> > while (x->x_selstart > 0 && x->x_buf[x->x_selstart] != '\n')
> >- x->x_selstart--;
> >+ u8_dec(x->x_buf, &x->x_selstart);
> > x->x_selend = x->x_selstart;
> > }
> > else if (!strcmp(keysym->s_name, "Down"))
> > {
> > while (x->x_selend < x->x_bufsize &&
> > x->x_buf[x->x_selend] != '\n')
> >- x->x_selend++;
> >+ u8_inc(x->x_buf, &x->x_selend);
> > if (x->x_selend < x->x_bufsize)
> >- x->x_selend++;
> >+ u8_inc(x->x_buf, &x->x_selend);
> > x->x_selstart = x->x_selend;
> > }
> > rtext_senditup(x, SEND_UPDATE, &w, &h, &indx);
> >Index: src/s_utf8.h
> >===================================================================
> >--- src/s_utf8.h (revision 0)
> >+++ src/s_utf8.h (revision 0)
> >@@ -0,0 +1,88 @@
> >+#ifndef S_UTF8_H
> >+#define S_UTF8_H
> >+
> >+/*--moo--*/
> >+#ifndef u_int32_t
> >+# define u_int32_t unsigned int
> >+#endif
> >+
> >+#ifndef UCS4
> >+# define UCS4 u_int32_t
> >+#endif
> >+
> >+/* UTF8_SUPPORT_FULL_UCS4
> >+ * define this to support the full potential range of UCS-4
> >codepoints
> >+ * (in anticipation of a future UTF-8 standard)
> >+ */
> >+/*#define UTF8_SUPPORT_FULL_UCS4 1*/
> >+#undef UTF8_SUPPORT_FULL_UCS4
> >+
> >+/* UTF8_MAXBYTES
> >+ * maximum number of bytes required to represent a single
> >character in UTF-8
> >+ *
> >+ * UTF8_MAXBYTES1 = UTF8_MAXBYTES+1
> >+ * maximum bytes per character including NUL terminator
> >+ */
> >+#ifdef UTF8_SUPPORT_FULL_UCS4
> >+# ifndef UTF8_MAXBYTES
> >+# define UTF8_MAXBYTES 6
> >+# endif
> >+# ifndef UTF8_MAXBYTES1
> >+# define UTF8_MAXBYTES1 7
> >+# endif
> >+#else
> >+# ifndef UTF8_MAXBYTES
> >+# define UTF8_MAXBYTES 4
> >+# endif
> >+# ifndef UTF8_MAXBYTES1
> >+# define UTF8_MAXBYTES1 5
> >+# endif
> >+#endif
> >+/*--/moo--*/
> >+
> >+/* is c the start of a utf8 sequence? */
> >+#define isutf(c) (((c)&0xC0)!=0x80)
> >+
> >+/* convert UTF-8 data to wide character */
> >+int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz);
> >+
> >+/* the opposite conversion */
> >+int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
> >+
> >+/* moo: get byte length of character number, or 0 if not supported */
> >+int u8_wc_nbytes(u_int32_t ch);
> >+
> >+/* moo: compute required storage for UTF-8 encoding of 's[0..n-1]' */
> >+int u8_wcs_nbytes(u_int32_t *ucs, int size);
> >+
> >+/* single character to UTF-8, no NUL termination */
> >+int u8_wc_toutf8(char *dest, u_int32_t ch);
> >+
> >+/* moo: single character to UTF-8, with NUL termination */
> >+int u8_wc_toutf8_nul(char *dest, u_int32_t ch);
> >+
> >+/* character number to byte offset */
> >+int u8_offset(char *str, int charnum);
> >+
> >+/* byte offset to character number */
> >+int u8_charnum(char *s, int offset);
> >+
> >+/* return next character, updating an index variable */
> >+u_int32_t u8_nextchar(char *s, int *i);
> >+
> >+/* move to next character */
> >+void u8_inc(char *s, int *i);
> >+
> >+/* move to previous character */
> >+void u8_dec(char *s, int *i);
> >+
> >+/* moo: move pointer to next character */
> >+void u8_inc_ptr(char **sp);
> >+
> >+/* moo: move pointer to previous character */
> >+void u8_dec_ptr(char **sp);
> >+
> >+/* returns length of next utf-8 sequence */
> >+int u8_seqlen(char *s);
> >+
> >+#endif /* S_UTF8_H */
> ><test-utf8.pd>
>
>
>
>
>
> ----------------------------------------------------------------------------
>
> "[T]he greatest purveyor of violence in the world today [is] my own
> government." - Martin Luther King, Jr.
>
>
>
>
> _______________________________________________
> Pd-dev mailing list
> Pd-dev at iem.at
> http://lists.puredata.info/listinfo/pd-dev
More information about the Pd-dev
mailing list