[PD] locales for Pd WAS: japanese encoded chars in PD

Thu Feb 12 06:24:44 CET 2009

On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:

> morning all,
>
> On 2009-02-11 03:04:34, Hans-Christoph Steiner <hans at eds.org>  
> appears to
> have written:
>> On Feb 10, 2009, at 3:14 PM, august wrote:
>>>> august wrote:
>>>>> hey'aw.
>>> are there also objects for handling conversions between character
>>> encodings?   Or, an object to convert between utf8 or UCS-2 and the
>>> unicode
>>> char code numbers that GEM takes?
>
> Well, there are [bytes2wchars] and [wchars2bytes] in the newest
> [pdstring] library, which convert between multibyte encodings such as
> utf8 and your C library's wchar_t, which if I'm not entirely  
> mistaken is
> a system-dependent encoding, but at least here (linux, glibc), it  
> looks
> a heckuva lot like UCS-4.
>
>>> Is there a default character encoding for PD messages? I assume it  
>>> is
>>> LATIN1 because I have seen umlauts in comments before(I think).   It
>>> doesn't look like I can make comments in UTF8 encoded chars.
>>>
>>> I have my char problems solved right now, but now as I discover more
>>> about the difficulties of character encodings and the treachery that
>>> ASCII has caused....I am just curious.
>>
>> Its a weird bastard mix currrently of Latin1 and UTF-8.  The Tk GUI  
>> can
>> handle UTF-8 and uses UTF-8 natively.  The C side is basically Latin1
>> but doesn't really check:
>
> Out of curiosity, I just checked with a variant of 'unibarf.pd'
> (attached as "barf-both.pd"), and for me, pd *does* display utf-8
> strings correctly in message boxes (tested with umlauts äöü, as well  
> as
> Greek &pi;&delta; -- other characters can be tested with the  
> [pdstring]
> help patches).  Surprisingly (to me), I don't have to do anything
> special to get UTF-8 characters displayed correctly, but setting
> LC_CTYPE=en_US.UTF-8 causes a latin-1 message to be displayed  
> improperly
> (characters disappear, but are still passed and present in raw byte  
> form).

Hmm, I am not sure that UTF-8 really is well supported.  Some chars  
get thru, but many don't.  Here's an example.  I typed these chars in  
a UTF-8 text editor as an png and a pd patch.  Not quite the same.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 1.png
Type: image/png
Size: 5334 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090212/4f1a98b1/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sometext.pd
Type: application/octet-stream
Size: 58 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090212/4f1a98b1/attachment.obj>
-------------- next part --------------

> Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
> error message from Pd though:
>
> Pd: buffer space wasn't sufficient for long GUI string
> (repeated 3 times)
>
> ... this appears on stderr, rather than the console.  I get the same
> message once for "barf-both.pd"; assumedly due to mis-parsing of the
> latin-1 message box(es).

I am guessing that the above error comes from the fact that Pd is  
written for latin1 where every char is always 1 byte, so sending UTF-8  
could confuse things, since UTF-8 can have multi-byte chars.

>> This is something that I would really like to have working properly  
>> in
>> Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should  
>> support
>> UTF-8 in Pd.  Anyone feel like trying to fix it?  I don't understand
>> encodings so well.
>
> I don't know for sure, but I suspect one problem might be in the
> interpretation of user input -- I use latin-1 myself, so I can't judge
> whether the Tk GUI accepts UTF-8 input or not (I use [pdstring] or  
> just
> hack the .pd file for my tests).  If we want to be paranoid about
> things, we're likely to run into problems with symbols too; symbol
> identity (hash value and raw byte string) can change depending on
> whether the C internals use UTF-8 strings or not: this depends not  
> only
> on what they get from the GUI, but also on how file data is  
> interpreted,
> netsend/netreceive, etc etc... (mostly t_binbuf, I guess).  UTF-8  
> should
> be largely safe for pd symbols, although I'm not sure whether  
> backslash
> or brackets can appear as shift bytes for any characters: that could
> certainly cause problems.

I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so  
that is no problem.

> As an experiment, you could try calling the following on Pd startup:
>
>  #include <locale.h>
>
>  setlocale(LC_ALL,"");      /*-- set locale from environment --*/
>  setlocale(LC_NUMERIC,"C"); /*-- ... but leave floats alone! --*/
>
> ... and see what breaks (or doesn't) ;-)  Alternatively, you can  
> achieve
> pretty much the same effect with the "locale" external in userspace  
> (see
> attached "uselocale.pd").  Of course, to test UTF-8 you should have  
> your
> environment variables set accordingly (in particular LC_CTYPE,
> potentially via LANG):
>
> bash$ export LC_CTYPE=en_DK.UTF-8
> bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly
>
> bash$ export LC_CTYPE=en_DK.ISO-8859-1
> bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok
>
> If it turns out to work well, we can of course make a trivial "dummy"
> external out of it for use with "-lib" ...

Hmm, I tried this on Mac OS X and it didn't seem to make a  
difference.  Perhaps its a platform issue, though on this level, Mac  
OS X is very much BSD, so I think it should work.

.hc

>
>
> marmosets,
> 	Bryan
>
> -- 
> Bryan Jurish                           "There is *always* one more  
> bug."
> jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic  
> Entomology
>
> <barf-both.pd><uselocale.pd>

----------------------------------------------------------------------------

News is what people want to keep hidden and everything else is  
publicity.          - Bill Moyers