[PD-dev] Shrink s_utf8.[ch]?
Hans-Christoph Steiner
hans at at.or.at
Sun Oct 9 01:23:36 CEST 2011
We talked a bunch in #dataflow on this topic, so I thought I'd post
the transcript here:
creamygoodness
IRC
11:16:06
mescalinum: I'm curious whether the patch to s_utf8.c I supplied
eliminates the segfault you described the other day.
mescalinum
IRC
11:38:33
creamygoodness, I'm unable to build pd right now
creamygoodness
IRC
11:39:14
haha, I guess it will wait.
11:39:40
The errors corrected by that patch could potentially cause segfaults.
_hc
IRC
11:39:41
creamygoodness: good morning
creamygoodness
IRC
11:39:54
Good morning, Hans.
_hc
IRC
11:40:03
I read your email, does tcl.h provide everything needed?
11:40:26
I think its a good idea, but I think Miller thinks of pd core as not
being dependent on Tcl
creamygoodness
IRC
11:40:30
I don't know -- I just glanced at it.
11:40:36
Ah, that was why I asked.
11:40:47
I noticed that tcl.h was not included in any file
11:40:54
though it was in the autconf.
_hc
IRC
11:41:03
right, Pd no longer uses the C interface to Tcl at all
11:41:15
it used to, because it had a C component of the GUI
creamygoodness
IRC
11:41:28
OK, then it seems fixing s_utf8.[ch] is the way to go.
_hc
IRC
11:41:28
but the core never needed it because it talks to Tcl thru a network
socket
11:41:37
yeah, probably
creamygoodness
IRC
11:41:40
Make sense.
11:43:23
BTW, the mechanism for the potential segfault is that u8_offset can
give the wrong answer and overshoot depending on what junk is after
the end of the non-NUL-terminated buffer, locating the cursor too far
ahead and potentially outside of a write buffer.
_hc
IRC
11:43:55
I guess the functions need a length arg then? How was this supposed to
work?
11:44:22
what about leaving the functions as is, and making sure the strings
are NULL terminated?
creamygoodness
IRC
11:44:47
I think Tcl uses non-nul-terminated strings?
_hc
IRC
11:46:01
hmm, internally in its implementation?
creamygoodness
IRC
11:46:19
I'm pretty sure the functions would be fine for NUL-terminated strings.
_hc
IRC
11:46:19
or when sending things via the network socket?
creamygoodness
IRC
11:47:00
Meaning that the UTF-8 representation of a Tcl_Obj doesn't necessarily
have NUL-termination.
11:47:17
I'm not sure about that, but I saw it go past in #tcl the other day.
_hc
IRC
11:47:24
ok, but we're not talking to Tcl_Obj directly anywhere
11:47:35
the communications are thru a network socket
11:47:54
pd --> pd-gui is Tcl code
11:47:59
pd-gui --> is Pd messges
11:48:27
do you have an example of a message or string that is not null-
terminated?
creamygoodness
IRC
11:49:24
The errors I was seeing under valgrind are were in g_rtext.c.
11:50:18
The object's x_buf was allocated to 1, contained a single meaningful
character and was not NUL-terminated.
11:51:09
For instance, here:
11:51:10
int x_bufsize_c = u8_charnum(x->x_buf, x->x_bufsize);
matju
IRC
11:51:28
creamygoodness: are you sure about the text representation of a
Tcl_Obj ? afaik, all interpreters I read about, add a nul-terminator
in the buffer but outside of the «official content», just to keep libc-
style functions happy.
creamygoodness
IRC
11:52:04
No, I'm not sure about it. I'd have to research it.
matju
IRC
11:52:12
creamygoodness: ... because then, calls that have to be done to libc
don't need a malloc & memcpy just to produce the kind of string that
libc wants.
creamygoodness
IRC
11:53:01
Yes, but the tradeoff is that your string operations must always
insert nul termination that you don't need -- and won't easily detect
when it's missing.
matju
IRC
11:53:31
creamygoodness: but i corroborate your report about g_rtext utf8
problem... i saw an Invalid Read of size 1 in Valgrind the other day
creamygoodness
IRC
11:54:33
I was able to trigger it in three different routines.
matju
IRC
11:55:03
creamygoodness: if you have one common api for producing strings, and
it's used all over, then you don't need to copy-paste all over the
code something for writing the nul-terminator...
creamygoodness
IRC
11:57:23
Yes. I've written an object model myself, and a string manipulation
library, so I understand what you're getting at.
11:57:42
But nevertheless, I hold up Pd itself as an example of where NUL-
termination is not always done religiously, but stuff still pretty
much works.
_hc
IRC
11:58:37
creamygoodness: the utf-8 patch author, Bryan Jurish, says: http://lists.puredata.info/pipermail/pd-dev/2009-03/013165.html
+ modified and trimmed versions of the Bezanson code. I actually only
use 3 or 4 functions of these, so it could conceivably be trimmed down
even further; I just cut out the stuff I don't expect to need (escapes,
vprintf, ...).
creamygoodness
IRC
12:00:09
matju: My personal preference is to harden all components so that you
do not have tight coupling. In this case that means making the UTF-8
manipulation code function properly in either the presence or absence
of NUL-termination.
_hc
IRC
12:00:49
creamygoodness: how can you trigger a non-Null string?
matju
IRC
12:00:51
creamygoodness: i only ever expected to fix u8_charnum, really.
matju
IRC
12:01:03
_hc: a what ?
creamygoodness
IRC
12:01:26
_hc: Just type text into an Object.
creamygoodness
IRC
12:01:58
The x_buf is sized to the text, and doesn't include NUL-termination.
12:02:24
If you want to see the memory errors, run under valgrind.
_hc
IRC
12:02:35
right, I'm new to valgrind, installing now
12:02:47
its been on my todo list to try for years
creamygoodness
IRC
12:02:50
It's a great tool. I love it.
12:02:58
It's very easy to get started with.
_hc
IRC
12:03:24
matju: a a string without null temrination
matju
IRC
12:06:21
_hc: we're talking about g_rtext.c
12:08:06
_hc: note the use of %.*s in g_rtext.c because the string is not null-
terminated, so it has to «be artificially truncated» at the point
where it ends
creamygoodness
IRC
12:09:24
_hc: regarding Jurish's message... ok, shrinking seems like a good
thing. I also did a check of Pd-extended to see if anybody had used
these routines and didn't turn anything up, so I think we're safe.
creamygoodness
IRC
12:10:14
That limits the scope of changes to the UTF-8 manipulation code.
_hc
IRC
12:10:50
creamygoodness: these routines are new in 0.43, so people haven't had
time to keep them around
12:10:53
I mean use them
creamygoodness
IRC
12:11:14
Is s_utf8.h an official public API?
_hc
IRC
12:12:06
its in the semi-public, close to private, part
12:12:18
m_pd.h is the official public API
creamygoodness
IRC
12:12:26
Heh.
_hc
IRC
12:12:27
the rest are not fully public
12:12:45
but range in degree of privateness
12:12:57
maxster [~maxster at dsl-69-172-92-231.acanac.net] entered the room.
_hc
IRC
12:12:58
g_canvas.h is closer to public
creamygoodness
IRC
12:13:06
Messy, but I understand.
_hc
IRC
12:13:12
s_utf8.h is probably the closest to private of the headers
_hc
IRC
12:33:28
wow, valgrind makes pd run slooow
matju
IRC
12:33:30
creamygoodness: maybe that's what MSc/PhD does to people
creamygoodness
IRC
12:33:53
_hc Yeah, like 40x slower.
matju
IRC
12:34:07
_hc: that's normal... valgrind emulates much of your CPU, and makes
*lots* of bugchecks, ALL of the time.
creamygoodness
IRC
12:34:16
Also, memory consumption is through the roof.
matju
IRC
12:34:42
creamygoodness: not nearly as bad as Bruce Perens' Electric Fence
creamygoodness
IRC
12:34:56
Haven't used that one.
_hc
IRC
12:36:12
ok, so typing the first two chars seems to make an error
12:36:39
but not always
creamygoodness
IRC
12:36:39
Yes. That's probably the u8_charnum line I quoted earlier.
matju
IRC
12:36:53
creamygoodness: i tried it at first, before trying valgrind...
libefence is the kind of thing that allocates an extra 4k for every
tiny malloc you do. In Pd it's often almost tolerable, but when trying
to debug the Ruby interpreter, boom, it wouldn't end swapping
_hc
IRC
12:36:56
it was only the first object I created, now the other sseem fine
12:36:57
groolot [~groolot at 84.6.90.253] entered the room.
creamygoodness
IRC
12:37:50
It's going to depend on the overallocation of x_buf.
12:38:05
groolot left the room.
creamygoodness
IRC
12:38:05
That could be system dependent.
matju
IRC
12:38:46
valgrind assumes that there is no overallocation... that's the
strictest assumption that leads to the safest code.
12:39:05
oh
12:40:02
well, there's possible overallocation in malloc(), but if you mean
overallocation in the application code, this can't be trapped, because
valgrind can only trap malloc() and what the app tells malloc.
12:41:09
there are often complicated, tricky reasons why the bug triggers only
part of the time on valgrind or other, but it's not necessarily always
worth looking at why it's the case...
creamygoodness
IRC
12:42:14
Not sure. But I can produce errors seemingly more easily than Hans.
IIRC, each character I type produces invalid read errors.
_hc
IRC
12:42:28
hmm, just found another error "==25249== Conditional jump or move
depends on uninitialised value(s)", seems to be Mac OS X-specific
creamygoodness
IRC
12:43:12
My methodology for bughunting with valgrind is to use it in
conjunction with debug printing to stderr.
matju
IRC
12:43:32
_hc: Conditional jump [...] happens in s_utf8.c, or is it elsewhere ?
_hc
IRC
12:43:56
elsehwere, looks like the CoreAudio driver
matju
IRC
12:44:02
creamygoodness: yeah, i do that too.
_hc
IRC
12:44:21
hmm, acutally in s_audio_pa.c
matju
IRC
12:44:49
_hc: several libraries do things that trigger warnings in Valgrind,
though some of them are really harmless
12:45:29
_hc: on linux, libdl and libX11 both give lots of warnings, and there
are others
creamygoodness
IRC
12:45:32
Right, you often need a "suppressions" file to quiet meaningless
warnings.
12:45:59
Well, you could actually argue that u8_charnum is such a routine.
matju
IRC
12:46:01
_hc: oh, if it's in s_audio_pa.c, then you have much more of an
opportunity to fix it... otherwise, use suppressions for ignoring
creamygoodness
IRC
12:46:17
The invalid reads in u8_charnum do not affect the output.
12:46:33
It just reads too far after it should have stopped.
matju
IRC
12:48:20
creamygoodness: imho, invalid reads in u8_charnum *can* affect the
output, by 1, sometimes. I don't say that by logging, just by reading
the code... unless i read it wrong
12:48:50
creamygoodness: mmm wait
creamygoodness
IRC
12:48:53
Well, I could be wrong too!
matju
IRC
12:48:58
creamygoodness: you're right.
creamygoodness
IRC
12:49:04
I'm more concerned about u8_offset.
12:49:17
That one, the output can be affected.
12:52:02
The funny thing about u8_charnum is that it actually functions
properly with a NUL-terminated string.
12:52:27
It checks the next byte to see if it's the start of a UTF-8 sequence.
12:52:49
A NUL byte *is* such a header byte, so the loop terminates.
12:53:03
sysreq left the room (quit: Ping timeout: 255 seconds).
creamygoodness
IRC
12:53:27
Or rather, the or-chain terminates.
12:53:54
sysreq [~sysreq at c-76-19-96-103.hsd1.ma.comcast.net] entered the room.
12:53:54
sysreq left the room (quit: Changing host).
12:53:54
sysreq [~sysreq at unaffiliated/sysreq] entered the room.
creamygoodness
IRC
12:53:58
I suppose we could add a valgrind suppression for that.
12:54:55
But I think the patch I supplied also does the job.
creamygoodness
IRC
12:58:54
_hc Here's some valgrind output from me: http://pastebin.com/mfYHuRgE
_hc
IRC
12:59:21
how do you get the function trace in it?
12:59:35
oh wait,I have that too
creamygoodness
IRC
12:59:57
That's what I get after firing up "valgrind ./src/pd" (for vanilla),
creating a new doc, creating a new object, and typing "abcde".
_hc
IRC
1:00:18
I think that if changing the u8_* funciton will slow it down, then I
think we should try to fix the code in g_rtext.c
1:00:23
so that it null terminates
creamygoodness
IRC
1:02:34
Well, I don't know how deep the assumption is in Pd that strings don't
need NUL-termination.
1:03:09
This might be a local bug, or it might not.
1:05:00
I can probably find a way to crash Pd consistently based on u8_offset,
though.
1:05:30
That would take work, though.
1:05:36
As far as speed
1:05:59
I don't have measurements
1:06:01
sysreq left the room (quit: Ping timeout: 252 seconds).
creamygoodness
IRC
1:07:16
I don't think the patched routines are likely to have added anything
meaningful, though they do add ops.
1:08:08
How much raw throughput of UTF-8 does Pd do?
1:08:24
I'm guessing that it's less than the fulltext search engine I work on.
1:10:17
And if you want fast, i would probably use a different algorithm in
the first place rather than go byte by byte.
1:10:56
Instead, you'd want to jump from header byte to header byte using the
trailing-bytes table.
creamygoodness
IRC
1:18:03
_hc: When the text object shows up in rtext_senditup, it is already
missing NUL-termination.
creamygoodness
IRC
1:18:20
Where would I look to find where the problem might arise?
_hc
IRC
1:18:24
creamygoodness: well, its a realtime engine that communicates thru the
utf8 stuff, so its good to have it efficient
creamygoodness
IRC
1:19:11
Understood.
CIA-81
IRC
1:22:42
pure-data: eighthave * r15546 /trunk/externals/creb/modules/ (24
files): converted float to t_float to support double-precision Pd,
creb still needs to separate t_float and t_sample tho
creamygoodness
IRC
1:26:48
I can produce another variant of u8_charnum and u8_offset which is
likely faster.
1:37:55
kk [~kareem at 41.233.110.109] entered the room.
1:38:21
kk is now known as Guest87522
_hc
IRC
1:38:23
that would be even better
creamygoodness
IRC
1:38:57
It would use the same technique as the Perl core (and I think Python
too).
1:39:31
KareemK left the room (quit: Ping timeout: 260 seconds).
creamygoodness
IRC
1:39:42
The primary difference is that it will potentially produce different
answers for malformed sequences.
matju
IRC
3:06:08
creamygoodness: what does your submitted fix cover ?
3:08:13
creamygoodness: and according to you, what can produce malformed
sequences, at this point ?
3:08:42
creamygoodness: afaik, it would mostly only be loading iso-latin-1
patches and similar
3:09:07
creamygoodness: this can be autodetected and autoconverted. i think
XChat does this, but if not, some other IRC client(s) have source code
for that.
On Oct 8, 2011, at 11:08 AM, Marvin Humphrey wrote:
> Greets,
>
> It looks like a lot of the symbols in s_utf8.h and s_utf8.c are not
> used
> anywhere in Vanilla.
>
> These are used:
>
> isutf
> u8_wc_nbytes
> u8_wc_toutf8
> u8_wc_toutf8_nul
> u8_offset
> u8_charnum
> u8_inc
> u8_dec
>
> These are not:
>
> u8_toucs
> u8_toutf8
> u8_wcs_nbytes (only declaration in s_utf8.h)
> u8_inc_ptr
> u8_dec_ptr
> u8_nextchar
> u8_strlen (only definition in s_utf8.c)
> trailingBytesForUTF8 (static array in s_utf8.c)
> offsetsFromUTF8 (static array in s_utf8.c)
>
> While trying to clean up memory errors from Object text editing, it
> became
> apparent that the UTF-8 manipulation library that Pd uses is
> mismatched: Pd's
> strings apparently do not have NUL-terminatation, but several of the
> UTF-8
> manipulation routines malfunction without it. I've supplied a patch
> for three
> such routines, but it seems better to eliminate any that we don't
> need and
> remove the maintenance burden rather than attempt to fix them.
>
> Would a patch that removes all the unused code from s_utf8.h and
> s_utf8.c be
> well-received? If a need arises in the future for any of the deleted
> functions, they can be resurrected from version control and vetted
> at that
> time.
>
> Alternately, would it be worthwhile to explore the possibility of
> relying on
> UTF-8 manipulation routines exposed via "tcl.h" and eliminating
> s_utf8.h and
> s_utf8.c entirely?
>
> Marvin Humphrey
>
>
> _______________________________________________
> Pd-dev mailing list
> Pd-dev at iem.at
> http://lists.puredata.info/listinfo/pd-dev
----------------------------------------------------------------------------
There is no way to peace, peace is the way. -A.J. Muste
More information about the Pd-dev
mailing list