[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD

Fri Feb 20 06:20:18 CET 2009

On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:

> moin Hans, moin list,
>
> On 2009-02-19 18:43:49, Hans-Christoph Steiner <hans at eds.org>  
> appears to
> have written:
>>
>> This is good news!  While the C changes aren't dead simple, they  
>> are not
>> bad.  I think they could be slightly simplified.  One thing that  
>> would
>> make it much easier to read the diff is if you create it without
>> whitespace changes.  So like this:
>>
>> svn diff -x -w
>
> oops, sorry... duly noted for future diffs ... I also set my emacs'
> tcl-indent-width to 8 ... sorry sorry sorry ...
>
>> As for the Tcl changes, I think we can include those now in Pd- 
>> devel, as
>> long they can work ok with unchanged C code.
>
> Done.
>
>> Then once the new Tcl GUI
>> is included we can refactor the C side of things with things like  
>> this.
>
>> One other thing, it seems that the ASCII char are handled differently
>> than the UTF-8 chars in g_rtext.c, I think you could use instead
>> wcswidth(), mbstowcs() or other UTF-8 functions as described in the
>> UTF-8 FAQ
>>
>> http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
>
> Certainly, but (A) we already have the UTF-8 byte string in keysym,  
> and
> we need to append that whole string to the buffer anyways, and (B)
> using wcswidth() & co requires forcing the locale to have a UTF-8
> LC_CTYPE.  I know I did this in m_pd.c, but I think that was a HACK  
> and
> that using locale functions here is the Wrong Way To Do It, because  
> it's
> dangerous, unportable, and slow (warning: rant follows):
>
> __dangerous__: setting the locale is global for all threads of a
> process; in  forcing the locale, we could conceivably mess with  
> desired
> behavior elsewhere (e.g. in externals).
>
> __unportable__: we don't even know if all users' machines *have* a  
> UTF-8
> locale installed, and even if they do, we don't know what it's called.
> If we don't force the encoding, we're stuck with either "C" (e.g.  
> ASCII;
> what we've got now in Pd-vanilla), or whatever the user is currently
> employing (after setlocale(LC_ALL,"")), which makes patches'  
> appearance
> dependent on the user's encoding (e.g. what we've got now in
> Pd-vanilla), and doesn't even work in the case of variable-length
> encodings such as UTF-8.
>
> __slow__: many locale-based conversion functions are known to be  
> pretty
> darned slow.  if we assume we're always dealing with (valid) UTF-8, we
> can speed things up considerably.  going straight to wchar_t is  
> another
> option, but would require many more changes on the C side, likely  
> break
> the C API, and wouldn't solve the locale-dependency of patches'
> appearances, which I think is a really good argument for UTF-8.

Isn't it pretty safe to assume these days that UTF-8 is supported?   
One thing I just found out is that Windows uses a 2-byte char natively  
(UCS-2?), I think Mac OS X uses UTF-8 natively.  I think that most  
Linux tools should work with UTF-8 too, especially since it can work  
as ASCII.

So you think we can have full UTF-8 support without using those  
functions?

> (rant finished now, sorry)
>
> That said, a faster implementation would probably result from mixing
> (something like) wcswidth() and strncpy(...,keysym).  Functions like
> wcswidth() and mbstowcs() are pretty easy to cook up if we assume
> wchar_t is UCS-4 and the multibyte encoding is UTF-8.

It seems to me that the wcswidth() would be used for measuring the  
length of the text for display in boxes.  I suppose strlen() could  
still be used for allocating and freeing memory, but I think that we  
should aim for clean code.  If you think the current way in your diff  
is the best, that's fine by me.

> There are a
> number of libraries and code snippets floating about in the net making
> just such assumptions. In this context: are there any licensing
> restrictions on code included in pd-devel?  So far, I've found one
> useful-looking (.c,.h) pair in the public domain, as well as some LGPL
> code from gnulib, which could be linked in statically.  There's also
> code from the Unicode Consortium themselves, but it's pretty monstrous
> (read "pedantic") and limited to string-to-string conversions.

Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed.  For this  
stage of Pd-devel, it would be good to keep it to something that can  
be BSD licensed.

.hc

>
>
> marmosets,
> 	Bryan
>
>> On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:
>>
>>> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8  
>>> across the
>>> board.  The TK side was easy (as Hans predicted);
> [snip]
>>> The C side is much hairier.
> [snip]
>
> -- 
> Bryan Jurish                           "There is *always* one more  
> bug."
> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic  
> Entomology

----------------------------------------------------------------------------

Access to computers should be unlimited and total.  - the hacker ethic