[PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD
Hans-Christoph Steiner
hans at eds.org
Fri Feb 20 06:20:18 CET 2009
On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:
> moin Hans, moin list,
>
> On 2009-02-19 18:43:49, Hans-Christoph Steiner <hans at eds.org>
> appears to
> have written:
>>
>> This is good news! While the C changes aren't dead simple, they
>> are not
>> bad. I think they could be slightly simplified. One thing that
>> would
>> make it much easier to read the diff is if you create it without
>> whitespace changes. So like this:
>>
>> svn diff -x -w
>
> oops, sorry... duly noted for future diffs ... I also set my emacs'
> tcl-indent-width to 8 ... sorry sorry sorry ...
>
>> As for the Tcl changes, I think we can include those now in Pd-
>> devel, as
>> long they can work ok with unchanged C code.
>
> Done.
>
>> Then once the new Tcl GUI
>> is included we can refactor the C side of things with things like
>> this.
>
>> One other thing, it seems that the ASCII char are handled differently
>> than the UTF-8 chars in g_rtext.c, I think you could use instead
>> wcswidth(), mbstowcs() or other UTF-8 functions as described in the
>> UTF-8 FAQ
>>
>> http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
>
> Certainly, but (A) we already have the UTF-8 byte string in keysym,
> and
> we need to append that whole string to the buffer anyways, and (B)
> using wcswidth() & co requires forcing the locale to have a UTF-8
> LC_CTYPE. I know I did this in m_pd.c, but I think that was a HACK
> and
> that using locale functions here is the Wrong Way To Do It, because
> it's
> dangerous, unportable, and slow (warning: rant follows):
>
> __dangerous__: setting the locale is global for all threads of a
> process; in forcing the locale, we could conceivably mess with
> desired
> behavior elsewhere (e.g. in externals).
>
> __unportable__: we don't even know if all users' machines *have* a
> UTF-8
> locale installed, and even if they do, we don't know what it's called.
> If we don't force the encoding, we're stuck with either "C" (e.g.
> ASCII;
> what we've got now in Pd-vanilla), or whatever the user is currently
> employing (after setlocale(LC_ALL,"")), which makes patches'
> appearance
> dependent on the user's encoding (e.g. what we've got now in
> Pd-vanilla), and doesn't even work in the case of variable-length
> encodings such as UTF-8.
>
> __slow__: many locale-based conversion functions are known to be
> pretty
> darned slow. if we assume we're always dealing with (valid) UTF-8, we
> can speed things up considerably. going straight to wchar_t is
> another
> option, but would require many more changes on the C side, likely
> break
> the C API, and wouldn't solve the locale-dependency of patches'
> appearances, which I think is a really good argument for UTF-8.
Isn't it pretty safe to assume these days that UTF-8 is supported?
One thing I just found out is that Windows uses a 2-byte char natively
(UCS-2?), I think Mac OS X uses UTF-8 natively. I think that most
Linux tools should work with UTF-8 too, especially since it can work
as ASCII.
So you think we can have full UTF-8 support without using those
functions?
> (rant finished now, sorry)
>
> That said, a faster implementation would probably result from mixing
> (something like) wcswidth() and strncpy(...,keysym). Functions like
> wcswidth() and mbstowcs() are pretty easy to cook up if we assume
> wchar_t is UCS-4 and the multibyte encoding is UTF-8.
It seems to me that the wcswidth() would be used for measuring the
length of the text for display in boxes. I suppose strlen() could
still be used for allocating and freeing memory, but I think that we
should aim for clean code. If you think the current way in your diff
is the best, that's fine by me.
> There are a
> number of libraries and code snippets floating about in the net making
> just such assumptions. In this context: are there any licensing
> restrictions on code included in pd-devel? So far, I've found one
> useful-looking (.c,.h) pair in the public domain, as well as some LGPL
> code from gnulib, which could be linked in statically. There's also
> code from the Unicode Consortium themselves, but it's pretty monstrous
> (read "pedantic") and limited to string-to-string conversions.
Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed. For this
stage of Pd-devel, it would be good to keep it to something that can
be BSD licensed.
.hc
>
>
> marmosets,
> Bryan
>
>> On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:
>>
>>> So I've tried to get the pd-devel 0.41.4 branch to use UTF-8
>>> across the
>>> board. The TK side was easy (as Hans predicted);
> [snip]
>>> The C side is much hairier.
> [snip]
>
> --
> Bryan Jurish "There is *always* one more
> bug."
> jurish at ling.uni-potsdam.de -Lubarsky's Law of Cybernetic
> Entomology
----------------------------------------------------------------------------
Access to computers should be unlimited and total. - the hacker ethic
More information about the Pd-list
mailing list