[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

Thu Feb 19 22:13:41 CET 2009

moin Hans, moin list,

On 2009-02-19 18:43:49, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
> 
> This is good news!  While the C changes aren't dead simple, they are not
> bad.  I think they could be slightly simplified.  One thing that would
> make it much easier to read the diff is if you create it without
> whitespace changes.  So like this:
> 
> svn diff -x -w

oops, sorry... duly noted for future diffs ... I also set my emacs'
tcl-indent-width to 8 ... sorry sorry sorry ...

> As for the Tcl changes, I think we can include those now in Pd-devel, as
> long they can work ok with unchanged C code.

Done.

> Then once the new Tcl GUI
> is included we can refactor the C side of things with things like this. 

> One other thing, it seems that the ASCII char are handled differently
> than the UTF-8 chars in g_rtext.c, I think you could use instead
> wcswidth(), mbstowcs() or other UTF-8 functions as described in the
> UTF-8 FAQ
>
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

Certainly, but (A) we already have the UTF-8 byte string in keysym, and
we need to append that whole string to the buffer anyways, and (B)
using wcswidth() & co requires forcing the locale to have a UTF-8
LC_CTYPE.  I know I did this in m_pd.c, but I think that was a HACK and
that using locale functions here is the Wrong Way To Do It, because it's
dangerous, unportable, and slow (warning: rant follows):

__dangerous__: setting the locale is global for all threads of a
process; in  forcing the locale, we could conceivably mess with desired
behavior elsewhere (e.g. in externals).

__unportable__: we don't even know if all users' machines *have* a UTF-8
locale installed, and even if they do, we don't know what it's called.
If we don't force the encoding, we're stuck with either "C" (e.g. ASCII;
what we've got now in Pd-vanilla), or whatever the user is currently
employing (after setlocale(LC_ALL,"")), which makes patches' appearance
dependent on the user's encoding (e.g. what we've got now in
Pd-vanilla), and doesn't even work in the case of variable-length
encodings such as UTF-8.

__slow__: many locale-based conversion functions are known to be pretty
darned slow.  if we assume we're always dealing with (valid) UTF-8, we
can speed things up considerably.  going straight to wchar_t is another
option, but would require many more changes on the C side, likely break
the C API, and wouldn't solve the locale-dependency of patches'
appearances, which I think is a really good argument for UTF-8.

(rant finished now, sorry)

That said, a faster implementation would probably result from mixing
(something like) wcswidth() and strncpy(...,keysym).  Functions like
wcswidth() and mbstowcs() are pretty easy to cook up if we assume
wchar_t is UCS-4 and the multibyte encoding is UTF-8.  There are a
number of libraries and code snippets floating about in the net making
just such assumptions. In this context: are there any licensing
restrictions on code included in pd-devel?  So far, I've found one
useful-looking (.c,.h) pair in the public domain, as well as some LGPL
code from gnulib, which could be linked in statically.  There's also
code from the Unicode Consortium themselves, but it's pretty monstrous
(read "pedantic") and limited to string-to-string conversions.

marmosets,
	Bryan

> On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:
> 
>> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the
>> board.  The TK side was easy (as Hans predicted);
[snip]
>> The C side is much hairier.
[snip]

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology