[PD-dev] Shrink s_utf8.[ch]?

Hans-Christoph Steiner hans at at.or.at
Sun Oct 9 01:23:36 CEST 2011


We talked a bunch in #dataflow on this topic, so I thought I'd post  
the transcript here:

creamygoodness
IRC
11:16:06
mescalinum: I'm curious whether the patch to s_utf8.c I supplied  
eliminates the segfault you described the other day.

mescalinum
IRC
11:38:33
creamygoodness, I'm unable to build pd right now

creamygoodness
IRC
11:39:14
haha, I guess it will wait.
11:39:40
The errors corrected by that patch could potentially cause segfaults.

_hc
IRC
11:39:41
creamygoodness: good morning

creamygoodness
IRC
11:39:54
Good morning, Hans.

_hc
IRC
11:40:03
I read your email, does tcl.h provide everything needed?
11:40:26
I think its a good idea, but I think Miller thinks of pd core as not  
being dependent on Tcl

creamygoodness
IRC
11:40:30
I don't know -- I just glanced at it.
11:40:36
Ah, that was why I asked.
11:40:47
I noticed that tcl.h was not included in any file
11:40:54
though it was in the autconf.

_hc
IRC
11:41:03
right, Pd no longer uses the C interface to Tcl at all
11:41:15
it used to, because it had a C component of the GUI

creamygoodness
IRC
11:41:28
OK, then it seems fixing s_utf8.[ch] is the way to go.

_hc
IRC
11:41:28
but the core never needed it because it talks to Tcl thru a network  
socket
11:41:37
yeah, probably

creamygoodness
IRC
11:41:40
Make sense.
11:43:23
BTW, the mechanism for the potential segfault is that u8_offset can  
give the wrong answer and overshoot depending on what junk is after  
the end of the non-NUL-terminated buffer, locating the cursor too far  
ahead and potentially outside of a write buffer.

_hc
IRC
11:43:55
I guess the functions need a length arg then? How was this supposed to  
work?
11:44:22
what about leaving the functions as is, and making sure the strings  
are NULL terminated?

creamygoodness
IRC
11:44:47
I think Tcl uses non-nul-terminated strings?

_hc
IRC
11:46:01
hmm, internally in its implementation?

creamygoodness
IRC
11:46:19
I'm pretty sure the functions would be fine for NUL-terminated strings.

_hc
IRC
11:46:19
or when sending things via the network socket?

creamygoodness
IRC
11:47:00
Meaning that the UTF-8 representation of a Tcl_Obj doesn't necessarily  
have NUL-termination.
11:47:17
I'm not sure about that, but I saw it go past in #tcl the other day.

_hc
IRC
11:47:24
ok, but we're not talking to Tcl_Obj directly anywhere
11:47:35
the communications are thru a network socket
11:47:54
pd --> pd-gui is Tcl code
11:47:59
pd-gui --> is Pd messges
11:48:27
do you have an example of a message or string that is not null- 
terminated?

creamygoodness
IRC
11:49:24
The errors I was seeing under valgrind are were in g_rtext.c.
11:50:18
The object's x_buf was allocated to 1, contained a single meaningful  
character and was not NUL-terminated.
11:51:09
For instance, here:
11:51:10
     int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);

matju
IRC
11:51:28
creamygoodness: are you sure about the text representation of a  
Tcl_Obj ? afaik, all interpreters I read about, add a nul-terminator  
in the buffer but outside of the «official content», just to keep libc- 
style functions happy.

creamygoodness
IRC
11:52:04
No, I'm not sure about it.  I'd have to research it.

matju
IRC
11:52:12
creamygoodness: ... because then, calls that have to be done to libc  
don't need a malloc & memcpy just to produce the kind of string that  
libc wants.

creamygoodness
IRC
11:53:01
Yes, but the tradeoff is that your string operations must always  
insert nul termination that you don't need -- and won't easily detect  
when it's missing.

matju
IRC
11:53:31
creamygoodness: but i corroborate your report about g_rtext utf8  
problem... i saw an Invalid Read of size 1 in Valgrind the other day

creamygoodness
IRC
11:54:33
I was able to trigger it in three different routines.

matju
IRC
11:55:03
creamygoodness: if you have one common api for producing strings, and  
it's used all over, then you don't need to copy-paste all over the  
code something for writing the nul-terminator...

creamygoodness
IRC
11:57:23
Yes.  I've written an object model myself, and a string manipulation  
library, so I understand what you're getting at.
11:57:42
But nevertheless, I hold up Pd itself as an example of where NUL- 
termination is not always done religiously, but stuff still pretty  
much works.

_hc
IRC
11:58:37
creamygoodness: the utf-8 patch author, Bryan Jurish, says: http://lists.puredata.info/pipermail/pd-dev/2009-03/013165.html 
  + modified and trimmed versions of the Bezanson code.  I actually only
use 3 or 4 functions of these, so it could conceivably be trimmed down
even further; I just cut out the stuff I don't expect to need (escapes,
vprintf, ...).

creamygoodness
IRC
12:00:09
matju: My personal preference is to harden all components so that you  
do not have tight coupling. In this case that means making the UTF-8  
manipulation code function properly in either the presence or absence  
of NUL-termination.

_hc
IRC
12:00:49
creamygoodness: how can you trigger a non-Null string?

matju
IRC
12:00:51
creamygoodness: i only ever expected to fix u8_charnum, really.

matju
IRC
12:01:03
_hc: a what ?

creamygoodness
IRC
12:01:26
_hc: Just type text into an Object.

creamygoodness
IRC
12:01:58
The x_buf is sized to the text, and doesn't include NUL-termination.
12:02:24
If you want to see the memory errors, run under valgrind.

_hc
IRC
12:02:35
right, I'm new to valgrind, installing now
12:02:47
its been on my todo list to try for years

creamygoodness
IRC
12:02:50
It's a great tool.  I love it.
12:02:58
It's very easy to get started with.

_hc
IRC
12:03:24
matju: a a string without null temrination

matju
IRC
12:06:21
_hc: we're talking about g_rtext.c
12:08:06
_hc: note the use of %.*s in g_rtext.c because the string is not null- 
terminated, so it has to «be artificially truncated» at the point  
where it ends

creamygoodness
IRC
12:09:24
_hc: regarding Jurish's message... ok, shrinking seems like a good  
thing.  I also did a check of Pd-extended to see if anybody had used  
these routines and didn't turn anything up, so I think we're safe.

creamygoodness
IRC
12:10:14
That limits the scope of changes to the UTF-8 manipulation code.

_hc
IRC
12:10:50
creamygoodness: these routines are new in 0.43, so people haven't had  
time to keep them around
12:10:53
I mean use them

creamygoodness
IRC
12:11:14
Is s_utf8.h an official public API?

_hc
IRC
12:12:06
its in the semi-public, close to private, part
12:12:18
m_pd.h is the official public API

creamygoodness
IRC
12:12:26
Heh.

_hc
IRC
12:12:27
the rest are not fully public
12:12:45
but range in degree of privateness
12:12:57
maxster [~maxster at dsl-69-172-92-231.acanac.net] entered the room.

_hc
IRC
12:12:58
g_canvas.h is closer to public

creamygoodness
IRC
12:13:06
Messy, but I understand.

_hc
IRC
12:13:12
s_utf8.h is probably the closest to private of the headers

_hc
IRC
12:33:28
wow, valgrind makes pd run slooow

matju
IRC
12:33:30
creamygoodness: maybe that's what MSc/PhD does to people

creamygoodness
IRC
12:33:53
_hc Yeah, like 40x slower.

matju
IRC
12:34:07
_hc: that's normal... valgrind emulates much of your CPU, and makes  
*lots* of bugchecks, ALL of the time.

creamygoodness
IRC
12:34:16
Also, memory consumption is through the roof.

matju
IRC
12:34:42
creamygoodness: not nearly as bad as Bruce Perens' Electric Fence

creamygoodness
IRC
12:34:56
   Haven't used that one.

_hc
IRC
12:36:12
ok, so typing the first two chars seems to make an error
12:36:39
but not always

creamygoodness
IRC
12:36:39
Yes.  That's probably the u8_charnum line I quoted earlier.

matju
IRC
12:36:53
creamygoodness: i tried it at first, before trying valgrind...  
libefence is the kind of thing that allocates an extra 4k for every  
tiny malloc you do. In Pd it's often almost tolerable, but when trying  
to debug the Ruby interpreter, boom, it wouldn't end swapping

_hc
IRC
12:36:56
it was only the first object I created, now the other sseem fine
12:36:57
groolot [~groolot at 84.6.90.253] entered the room.

creamygoodness
IRC
12:37:50
It's going to depend on the overallocation of x_buf.
12:38:05
groolot left the room.

creamygoodness
IRC
12:38:05
That could be system dependent.

matju
IRC
12:38:46
valgrind assumes that there is no overallocation... that's the  
strictest assumption that leads to the safest code.
12:39:05
oh
12:40:02
well, there's possible overallocation in malloc(), but if you mean  
overallocation in the application code, this can't be trapped, because  
valgrind can only trap malloc() and what the app tells malloc.
12:41:09
there are often complicated, tricky reasons why the bug triggers only  
part of the time on valgrind or other, but it's not necessarily always  
worth looking at why it's the case...

creamygoodness
IRC
12:42:14
Not sure.  But I can produce errors seemingly more easily than Hans.   
IIRC, each character I type produces invalid read errors.

_hc
IRC
12:42:28
hmm, just found another error "==25249== Conditional jump or move  
depends on uninitialised value(s)", seems to be Mac OS X-specific

creamygoodness
IRC
12:43:12
My methodology for bughunting with valgrind is to use it in  
conjunction with debug printing to stderr.

matju
IRC
12:43:32
_hc: Conditional jump [...] happens in s_utf8.c, or is it elsewhere ?

_hc
IRC
12:43:56
elsehwere, looks like the CoreAudio driver

matju
IRC
12:44:02
creamygoodness: yeah, i do that too.

_hc
IRC
12:44:21
hmm, acutally in s_audio_pa.c

matju
IRC
12:44:49
_hc: several libraries do things that trigger warnings in Valgrind,  
though some of them are really harmless
12:45:29
_hc: on linux, libdl and libX11 both give lots of warnings, and there  
are others

creamygoodness
IRC
12:45:32
Right, you often need a "suppressions" file to quiet meaningless  
warnings.
12:45:59
Well, you could actually argue that u8_charnum  is such a routine.

matju
IRC
12:46:01
_hc: oh, if it's in s_audio_pa.c, then you have much more of an  
opportunity to fix it... otherwise, use suppressions for ignoring

creamygoodness
IRC
12:46:17
The invalid reads in u8_charnum do not affect the output.
12:46:33
It just reads too far after it should have stopped.

matju
IRC
12:48:20
creamygoodness: imho, invalid reads in u8_charnum *can* affect the  
output, by 1, sometimes. I don't say that by logging, just by reading  
the code... unless i read it wrong
12:48:50
creamygoodness: mmm wait

creamygoodness
IRC
12:48:53
Well, I could be wrong too!

matju
IRC
12:48:58
creamygoodness: you're right.

creamygoodness
IRC
12:49:04
I'm more concerned about u8_offset.
12:49:17
That one, the output can be affected.
12:52:02
The funny thing about u8_charnum is that it actually functions  
properly with a NUL-terminated string.
12:52:27
It checks the next byte to see if it's the start of a UTF-8 sequence.
12:52:49
A NUL byte *is* such a header byte, so the loop terminates.
12:53:03
sysreq left the room (quit: Ping timeout: 255 seconds).

creamygoodness
IRC
12:53:27
Or rather, the or-chain terminates.
12:53:54
sysreq [~sysreq at c-76-19-96-103.hsd1.ma.comcast.net] entered the room.
12:53:54
sysreq left the room (quit: Changing host).
12:53:54
sysreq [~sysreq at unaffiliated/sysreq] entered the room.

creamygoodness
IRC
12:53:58
I suppose we could add a valgrind suppression for that.
12:54:55
But I think the patch I supplied also does the job.

creamygoodness
IRC
12:58:54
_hc Here's some valgrind output from me: http://pastebin.com/mfYHuRgE

_hc
IRC
12:59:21
how do you get the function trace in it?
12:59:35
oh wait,I have that too

creamygoodness
IRC
12:59:57
That's what I get after firing up "valgrind ./src/pd" (for vanilla),  
creating a new doc, creating a new object, and typing "abcde".

_hc
IRC
1:00:18
I think that if changing the u8_* funciton will slow it down, then I  
think we should try to fix the code in g_rtext.c
1:00:23
so that it null terminates

creamygoodness
IRC
1:02:34
Well, I don't know how deep the assumption is in Pd that strings don't  
need NUL-termination.
1:03:09
This might be a local bug, or it might not.
1:05:00
I can probably find a way to crash Pd consistently based on u8_offset,  
though.
1:05:30
That would take work, though.
1:05:36
As far as speed
1:05:59
I don't have measurements
1:06:01
sysreq left the room (quit: Ping timeout: 252 seconds).

creamygoodness
IRC
1:07:16
I don't think the patched routines are likely to have added anything  
meaningful, though they do add ops.
1:08:08
How much raw throughput of UTF-8 does Pd do?
1:08:24
I'm guessing that it's less than the fulltext search engine I work on.
1:10:17
And if you want fast, i would probably use a different algorithm in  
the first place rather than go byte by byte.
1:10:56
Instead, you'd want to jump from header byte to header byte using the  
trailing-bytes table.

creamygoodness
IRC
1:18:03
_hc: When the text object shows up in rtext_senditup, it is already  
missing NUL-termination.

creamygoodness
IRC
1:18:20
Where would I look to find where the problem might arise?

_hc
IRC
1:18:24
creamygoodness: well, its a realtime engine that communicates thru the  
utf8 stuff, so its good to have it efficient

creamygoodness
IRC
1:19:11
Understood.

CIA-81
IRC
1:22:42
pure-data: eighthave * r15546 /trunk/externals/creb/modules/ (24  
files): converted float to t_float to support double-precision Pd,  
creb still needs to separate t_float and t_sample tho

creamygoodness
IRC
1:26:48
I can produce another variant of u8_charnum and u8_offset which is  
likely faster.
1:37:55
kk [~kareem at 41.233.110.109] entered the room.
1:38:21
kk is now known as Guest87522

_hc
IRC
1:38:23
that would be even better

creamygoodness
IRC
1:38:57
It would use the same technique as the Perl core (and I think Python  
too).
1:39:31
KareemK left the room (quit: Ping timeout: 260 seconds).

creamygoodness
IRC
1:39:42
The primary difference is that it will potentially produce different  
answers for malformed sequences.

matju
IRC
3:06:08
creamygoodness: what does your submitted fix cover ?
3:08:13
creamygoodness: and according to you, what can produce malformed  
sequences, at this point ?
3:08:42
creamygoodness: afaik, it would mostly only be loading iso-latin-1  
patches and similar
3:09:07
creamygoodness: this can be autodetected and autoconverted. i think  
XChat does this, but if not, some other IRC client(s) have source code  
for that.

On Oct 8, 2011, at 11:08 AM, Marvin Humphrey wrote:

> Greets,
>
> It looks like a lot of the symbols in s_utf8.h and s_utf8.c are not  
> used
> anywhere in Vanilla.
>
> These are used:
>
>    isutf
>    u8_wc_nbytes
>    u8_wc_toutf8
>    u8_wc_toutf8_nul
>    u8_offset
>    u8_charnum
>    u8_inc
>    u8_dec
>
> These are not:
>
>    u8_toucs
>    u8_toutf8
>    u8_wcs_nbytes         (only declaration in s_utf8.h)
>    u8_inc_ptr
>    u8_dec_ptr
>    u8_nextchar
>    u8_strlen             (only definition in s_utf8.c)
>    trailingBytesForUTF8  (static array in s_utf8.c)
>    offsetsFromUTF8       (static array in s_utf8.c)
>
> While trying to clean up memory errors from Object text editing, it  
> became
> apparent that the UTF-8 manipulation library that Pd uses is  
> mismatched: Pd's
> strings apparently do not have NUL-terminatation, but several of the  
> UTF-8
> manipulation routines malfunction without it.  I've supplied a patch  
> for three
> such routines, but it seems better to eliminate any that we don't  
> need and
> remove the maintenance burden rather than attempt to fix them.
>
> Would a patch that removes all the unused code from s_utf8.h and  
> s_utf8.c be
> well-received?  If a need arises in the future for any of the deleted
> functions, they can be resurrected from version control and vetted  
> at that
> time.
>
> Alternately, would it be worthwhile to explore the possibility of  
> relying on
> UTF-8 manipulation routines exposed via "tcl.h" and eliminating  
> s_utf8.h and
> s_utf8.c entirely?
>
> Marvin Humphrey
>
>
> _______________________________________________
> Pd-dev mailing list
> Pd-dev at iem.at
> http://lists.puredata.info/listinfo/pd-dev



----------------------------------------------------------------------------

There is no way to peace, peace is the way.       -A.J. Muste





More information about the Pd-dev mailing list