[PD-dev] [ pure-data-Patches-3420484 ] Invalid memory access in s_utf8.c

Sun Oct 9 22:04:47 CEST 2011

Patches item #3420484, was opened at 2011-10-07 20:10
Message generated for change (Comment added) made by creamygoodness
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=478072&aid=3420484&group_id=55736

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: puredata
Group: bugfix
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Marvin Humphrey (creamygoodness)
Assigned to: Nobody/Anonymous (nobody)
Summary: Invalid memory access in s_utf8.c

Initial Comment:
After hearing on IRC that segfaults could be produced while editing text in an
Object box under gdb, I tried running the Vanilla main git branch under
Valgrind:

    valgrind ./src/pd

Valgrind reported many "invalid read" memory errors, which on inspection, all 
arose from the same root cause: these strings are not NULL-terminated, but the 
UTF-8 handling code in s_utf8.c assumes NULL-termination and malfunctions in
its absence.

The attached patch suffices to eliminate the read errors. However, it does not 
address all the problems in s_utf8.c; other functions will require similar
treatment.

I have attempted make the new code as compatible with the old code as
possible, for instance preserving its dubious algorithm for handling missing
or excess continuation bytes.  Hopefully I got everything right, though it is
hard to be certain with bit-twiddling code like this in the absence of 
thorough unit tests.

----------------------------------------------------------------------

Comment By: Marvin Humphrey (creamygoodness)
Date: 2011-10-09 13:04

Message:
I have uploaded a new version of the patch for s_utf8.c, and also a new
version of the benchmarking app.  Highlights:

1. The benchmarking app now handles four different functions.
2. All changed functions have been sped up since the last patch,
especially u8_charnum().

u8_offset: 16% faster
u8_charnum: 22% faster
u8_inc: 10% faster

----------------------------------------------------------------------

Comment By: Hans-Christoph Steiner (eighthave)
Date: 2011-10-09 09:54

Message:
This sounds great, I'll include it in Pd-extended and we'll see how it does
there.  The only problem is that there are two patches included in this
tracker and they seem to have the same name.  Could you delete the older
one, so there is only one patch?

----------------------------------------------------------------------

Comment By: Marvin Humphrey (creamygoodness)
Date: 2011-10-08 17:08

Message:
In response to speed concerns raised in an IRC conversation ("well, its a
realtime engine that communicates thru the utf8 stuff, so its good to have
it
efficient"), I have prepared a benchmark program to test the performance
of
u8_offset().  It reads a file into RAM, then traverses it with u8_offset()
a
user-specified number of iterations.

Here are some results using the French wikipedia page on Pure Data as
source
material.

Without patch:

    marvin at smokey:~/projects/pure-data $ cc -Wall -Wextra -std=gnu99 -O3
bench.c src/s_utf8.c -o bench
    marvin at smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real	0m0.732s
    user	0m0.716s
    sys	0m0.003s

Running the test 10 times produced a range of 0.724s to 0.733s, so the
benchmark was pretty stable even on a Mac. :)

With patch:

    marvin at smokey:~/projects/pure-data $ cc -Wall -Wextra -std=gnu99 -O3
bench.c src/s_utf8.c -o bench
    marvin at smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real	0m0.618s
    user	0m0.612s
    sys	0m0.004s

So, in addition to eliminating the Valgrind warnings, the patched version
is
actually faster.  I speculate that this is because the current version of
u8_offset() needlessly traverses at least one extra byte when the header
byte of
the sequence is plain ASCII.

For what it's worth, if we remove the NUL-termination check from the loop
condition...

    -    while (charnum > 0 && str[offs]) {
    +    while (charnum > 0) {

We get even better:

    marvin at smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real	0m0.514s
    user	0m0.507s
    sys	0m0.004s

However, that could yield bogus results in the event that strlen(string)
differs from the buffer length held in a separate integer.  I would go
that
route in code which had been engineered to not need NUL-termination from
the
start, but wouldn't advocate for it here.

As a further test, I tried an implementation based on u8_seqlen() which
has
the advantage of being easier to understand, and is the algorithm
currently
used by the Perl core among others:

	while (charnum-- > 0 && str[offs]) {
		offs += u8_seqlen(str + offs);
	}

It performed substantially worse:

    marvin at smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real	0m1.906s
    user	0m1.891s
    sys	0m0.003s

In this case, I speculate that the lookup into the trailingBytesForUTF8
array
is costly, even though it doubtless gets into the hottest CPU cache and
stays
there.

The benchmarks were run on a 2007 Macbook Pro with a 2.4 GHz Core Duo
chipset running OS X 10.6 (Snow Leopard).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=478072&aid=3420484&group_id=55736