[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

Fri Feb 20 11:53:26 CET 2009

moin all,

On 2009-02-20 06:20:18, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
> On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:
>> moin Hans, moin list, On 2009-02-19 18:43:49, Hans-Christoph
>> Steiner <hans at eds.org> appears to have written:
>>> One other thing, it seems that the ASCII char are handled
>>> differently than the UTF-8 chars in g_rtext.c, I think you could
>>> use instead wcswidth(), mbstowcs() or other UTF-8 functions as
>>> described in the UTF-8 FAQ
>>> 
>>> http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
>> 
>> Certainly, but (A) we already have the UTF-8 byte string in keysym,
>> and we need to append that whole string to the buffer anyways, and
>> (B) using wcswidth() & co requires forcing the locale to have a
>> UTF-8 LC_CTYPE.  I know I did this in m_pd.c, but I think that was
>> a HACK and that using locale functions here is the Wrong Way To Do
>> It, because it's dangerous, unportable, and slow (warning: rant
>> follows):
>> 
>> __dangerous__: setting the locale is global for all threads of a 
>> process; in  forcing the locale, we could conceivably mess with
>> desired behavior elsewhere (e.g. in externals).
>> 
>> __unportable__: we don't even know if all users' machines *have* a
>> UTF-8 locale installed, and even if they do, we don't know what
>> it's called. If we don't force the encoding, we're stuck with
>> either "C" (e.g. ASCII; what we've got now in Pd-vanilla), or
>> whatever the user is currently employing (after
>> setlocale(LC_ALL,"")), which makes patches' appearance dependent on
>> the user's encoding (e.g. what we've got now in Pd-vanilla), and
>> doesn't even work in the case of variable-length encodings such as
>> UTF-8.
>> 
>> __slow__: many locale-based conversion functions are known to be
>> pretty darned slow.  if we assume we're always dealing with (valid)
>> UTF-8, we can speed things up considerably.  going straight to
>> wchar_t is another option, but would require many more changes on
>> the C side, likely break the C API, and wouldn't solve the
>> locale-dependency of patches' appearances, which I think is a
>> really good argument for UTF-8.
> 
> Isn't it pretty safe to assume these days that UTF-8 is supported?

Yes, but under what name?  Also, I believe the relevant locale variable
(LC_CTYPE) requires a language component prior to the charmap, and we
cannot guarantee that e.g. "en_US" is installed everywhere.  The only
locale guaranteed to be installed everywhere is "C", and that determines
language and charmap simultaneously.

Also, the "dangerous" property is impossible to get around, unless maybe
we treat the locale like a stack and only force
LC_CTYPE="(whatever).UTF-8" in code where we know we want/need UTF-8.  I
suspect this might slow things down enormously (although I haven't
tested exactly what kind of overhead is involved).  Adding threads to
the picture means that we would have to add locking on LC_CTYPE (or
similar) and that would only work if hypothetical locale-sensitive
externals respected the same locks.  All in all more trouble than it's
worth, IM(ns)HO.

> One thing I just found out is that Windows uses a 2-byte char
> natively (UCS-2?),

Probably.

> I think Mac OS X uses UTF-8 natively. 

... but not for wchar_t (which would be superfluous if sizeof(wchar_t)==1) !

> I think that most Linux tools should work with UTF-8 too, especially since it
> can work as ASCII.

Yes, but "working with" UTF-8 is by no means synonymous with supporting
a particular (and known) value of LC_CTYPE which happens to use UTF-8 as
its charmap.  Most text-processing tools "work with" UTF-8 because they
can get away with just churning bytes -- this is not the case for Pd
(which counts characters to move the selection, edit buffers, determine
box widths, and maybe more)...

> So you think we can have full UTF-8 support without using those
> functions?

In a word: yes.

Specifically, I think we can have full UTF-8 support without using those
functions *as provided by the C99 locale API*.  That amounts to rolling
our own versions of the same and/or similar functionality.  In
particular, the (utf8.c,utf8.h) code by Jeff Bezanson (see
http://www.cprogramming.com/tutorial/unicode.html) has some attractive
utilities for wrapping typical string-processing code (in particular,
u8_inc() and u8_dec() for adapting old byte-string processing code using
i++ and i--, respectively), in addition to wrappers for the usual
locale-style functionality:

 wcswidth() --> (trivial)   (I've written the code)
 mbstowcs() --> u8_toucs()  (I've actually got a version of this too)

Other of Bezanson's utilities (isutf8(), u8_offset(), u8_charnum(),
u8_nextchar()) are also potentially useful for adapting the C side, and
in some cases, I'm not even sure how to wrap them with the C locale
functions without converting the whole UTF-8 string to wchar_t, which I
think we can agree we do not want to do.  Assumedly, Bezanson's code
(public domain) code is safe for integration with anything, so I'll use
that for now, if no one objects.

>> That said, a faster implementation would probably result from
>> mixing (something like) wcswidth() and strncpy(...,keysym->s_name).
>> Functions like wcswidth() and mbstowcs() are pretty easy to cook up
>> if we assume wchar_t is UCS-4 and the multibyte encoding is UTF-8.
> 
> It seems to me that the wcswidth() would be used for measuring the 
> length of the text for display in boxes.  I suppose strlen() could
> still be used for allocating and freeing memory, but I think that we
> should aim for clean code.  If you think the current way in your diff
> is the best, that's fine by me.

Yep.  I suspect we might not get around adding a "x_bufchars" field (or
similar) to t_rtext (struct _rtext), to cache the length of the buffer
in logical characters, rather than bytes.  We can always compute the
former in O(n) by iterating over the buffer, but I think it will be
needed to often for that.

To clarify:
(1) I think my use of locale-dependent functions (sprintf("%C",...)) to
prepare a string for gensym() is sick bad ugly and wrong, only a
temporary solution which should be replaced by locale-independent, UTF-8
specific code analagous to wctomb(), such as Bezanson's u8_wc_toutf8()
(the "_wc_" infix of which implies wchar_t, but the code actually
assumes that the "wc" parameter is UCS-4; we cannot guarantee this for
system-dependent wctomb() implementations; I just used it because I know
glibc appears to use UCS-4 as wchar_t, and I wanted to get a clear
picture of where the (other) problems lay.  The Tk bind() manpage says
that the "%A" substitution (which Pd is getting as 'keynum') is replaced
by "the UNICODE character corresponding to the event", but afaik the C99
standard does not require that wchar_t contain unicode values: it can be
any libc-dependent wide character fixed-width encoding... chalk that one
up under "unportable")

(2) I think using strncpy(buf,keysym->s_name) is safe and portable and
unlikely to cause any difficulties, although it might be prettier to
replace it with (another) call to wctomb(): that's just an
aesthetic/efficiency issue, as far as I'm concerned.

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology