[PD] japanese encoded chars in PD

Bryan Jurish moocow at ling.uni-potsdam.de
Wed Feb 11 12:34:09 CET 2009


morning all,

On 2009-02-11 03:04:34, Hans-Christoph Steiner <hans at eds.org> appears to
have written:
> On Feb 10, 2009, at 3:14 PM, august wrote:
>>> august wrote:
>>>> hey'aw.
>> are there also objects for handling conversions between character
>> encodings?   Or, an object to convert between utf8 or UCS-2 and the
>> unicode
>> char code numbers that GEM takes?

Well, there are [bytes2wchars] and [wchars2bytes] in the newest
[pdstring] library, which convert between multibyte encodings such as
utf8 and your C library's wchar_t, which if I'm not entirely mistaken is
a system-dependent encoding, but at least here (linux, glibc), it looks
a heckuva lot like UCS-4.

>> Is there a default character encoding for PD messages? I assume it is
>> LATIN1 because I have seen umlauts in comments before(I think).   It
>> doesn't look like I can make comments in UTF8 encoded chars.
>>
>> I have my char problems solved right now, but now as I discover more
>> about the difficulties of character encodings and the treachery that
>> ASCII has caused....I am just curious.
> 
> Its a weird bastard mix currrently of Latin1 and UTF-8.  The Tk GUI can
> handle UTF-8 and uses UTF-8 natively.  The C side is basically Latin1
> but doesn't really check:

Out of curiosity, I just checked with a variant of 'unibarf.pd'
(attached as "barf-both.pd"), and for me, pd *does* display utf-8
strings correctly in message boxes (tested with umlauts äöü, as well as
Greek &pi;&delta; -- other characters can be tested with the [pdstring]
help patches).  Surprisingly (to me), I don't have to do anything
special to get UTF-8 characters displayed correctly, but setting
LC_CTYPE=en_US.UTF-8 causes a latin-1 message to be displayed improperly
(characters disappear, but are still passed and present in raw byte form).

Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd
error message from Pd though:

 Pd: buffer space wasn't sufficient for long GUI string
 (repeated 3 times)

... this appears on stderr, rather than the console.  I get the same
message once for "barf-both.pd"; assumedly due to mis-parsing of the
latin-1 message box(es).

> This is something that I would really like to have working properly in
> Pd-devel.  Tcl/Tk is natively UTF-8, so it seems that we should support
> UTF-8 in Pd.  Anyone feel like trying to fix it?  I don't understand
> encodings so well.

I don't know for sure, but I suspect one problem might be in the
interpretation of user input -- I use latin-1 myself, so I can't judge
whether the Tk GUI accepts UTF-8 input or not (I use [pdstring] or just
hack the .pd file for my tests).  If we want to be paranoid about
things, we're likely to run into problems with symbols too; symbol
identity (hash value and raw byte string) can change depending on
whether the C internals use UTF-8 strings or not: this depends not only
on what they get from the GUI, but also on how file data is interpreted,
netsend/netreceive, etc etc... (mostly t_binbuf, I guess).  UTF-8 should
be largely safe for pd symbols, although I'm not sure whether backslash
or brackets can appear as shift bytes for any characters: that could
certainly cause problems.

As an experiment, you could try calling the following on Pd startup:

  #include <locale.h>

  setlocale(LC_ALL,"");      /*-- set locale from environment --*/
  setlocale(LC_NUMERIC,"C"); /*-- ... but leave floats alone! --*/

... and see what breaks (or doesn't) ;-)  Alternatively, you can achieve
pretty much the same effect with the "locale" external in userspace (see
attached "uselocale.pd").  Of course, to test UTF-8 you should have your
environment variables set accordingly (in particular LC_CTYPE,
potentially via LANG):

 bash$ export LC_CTYPE=en_DK.UTF-8
 bash$ pd uselocale.pd barf-both.pd   ##-- latin-1 displays incorrectly

 bash$ export LC_CTYPE=en_DK.ISO-8859-1
 bash$ pd uselocale.pd barf-both.pd   ##-- all displays ok

If it turns out to work well, we can of course make a trivial "dummy"
external out of it for use with "-lib" ...

marmosets,
	Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jurish at ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology

-------------- next part --------------
A non-text attachment was scrubbed...
Name: barf-both.pd
Type: application/puredata
Size: 374 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090211/5cb7fd6f/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: uselocale.pd
Type: application/puredata
Size: 280 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-list/attachments/20090211/5cb7fd6f/attachment-0001.bin>


More information about the Pd-list mailing list