[PD-dev] broken msgs cause Tcl traces, how to debug it?

Mon Jan 23 22:05:51 CET 2012

Ok, I think I nailed it with lots of help from #tcl on freenode.  It not
only seems to fix the intermittent problem, but also seems to be a bit
faster, due to using Tcl properly.  I attached the patch I just
committed to pd-extended.  The while {[info complete]} loop did a lot of
extra work because it was running again and again on the same chunk. 
That is mostly avoided by first checking if the block ends in a '\n'. 
In my tests, I never saw [info complete] get called or the "missing
close-brace" error get thrown, everything was caught by the test for
'\n' at the end.  Then, I removed all [read] calls and recursive calls
to [pd_readsocket] because they'll just be working on an empty buffer. 
Instead the code now waits for the next [fileevent], where Tcl tells us
that there is indeed more data in the buffer.

Here's a lightly edited transcript of the session:

(02:32:35 PM) _hc: thommey kbk Gotisch aku: ok, I changed the
implementation to use [gets] and the problem is the same (but in general
execution seems a little faster than my previous approach using [read])
(02:33:01 PM) _hc: I still get the occasional breaks at the wrong place
at the 48k mark
(02:34:23 PM) _hc: http://pastebin.com/trwpnkg7
(02:34:26 PM) _hc: thats' the proc
(02:34:39 PM) _hc: I tried -buffering none and line 
(02:34:45 PM) _hc: I tried -buffersize 1000000
14:35
(02:35:12 PM) Setok left the room (quit: Quit: Setok).
(02:38:44 PM) ijchain: <aku> As I said, the OS may not deliver the whole
thing to you in one packet / readable callback. After the loop 11-14 you
do not know of the command is complete yet, or not, without checking via
[info complete].
(02:39:22 PM) ijchain: <dgp> syntactical completeness and message
completeness are different things as well.
14:40
(02:41:05 PM) ijchain: <dgp> perhaps [gets] has already taken care of
that by delivering only compelte lines.
(02:41:20 PM) ijchain: <aku> The recursive calls of the handler do not
look good to me either, not really. Because we are at a point where
[gets] returned -1, so we should not expect that we immediately get more
data in the recursive call.
(02:41:21 PM) _hc: I think [gets] is supposed to do that
(02:41:49 PM) _hc: the recursive handler wasn't called at all in my
tests
(02:41:50 PM) ijchain: <dgp> Is the protocol defined so that only
newline is used to terminate commands, and never the semicolon?
(02:42:09 PM) _hc: the gets protocol?
(02:42:24 PM) ijchain: <dgp> the protocol by which these two programs
communicate over the socket.
(02:43:17 PM) ijchain: <dgp> If unconstrained Tcl scripts are the unit
of communication, be sure you handle semicolons correctly according to
the needs of your program.
(02:43:21 PM) _hc: the protocol from C->tcl is literally UTF-8 Tcl code
(02:43:28 PM) _hc: that is executed
(02:43:40 PM) _hc: see the uplevel
(02:44:38 PM) ijchain: <aku> Is it possible for a single command to span
multiple lines ?
(02:44:42 PM) _hc: semi-colons are handled.  Its an old program, started
in '95.  We just recently wrote the GUI side of it.  Previuosly we used
C to handle the networking on the Tcl side too, but I switched the code
to pure Tcl
(02:44:44 PM) ijchain: <dgp> yes
(02:44:58 PM) _hc: this bug has only arisen in the pure Tcl
implementation
14:45
(02:45:06 PM) ijchain: <dgp> set var {a
(02:45:06 PM) ijchain: <dgp> b}
(02:45:23 PM) ijchain: <dgp> or, more commonly...
(02:45:33 PM) _hc: the problematic breaks always come in the middle of
commands, let me post some examples, one sec
(02:45:41 PM) ijchain: <dgp> proc foo {} {
(02:45:41 PM) ijchain: <dgp> many
(02:45:41 PM) ijchain: <dgp> lines
(02:45:41 PM) ijchain: <dgp> here
(02:45:41 PM) ijchain: <dgp> }
(02:46:09 PM) ijchain: <dgp> "middle of commands" means not at a
newline?
(02:46:54 PM) Moryanta [Timmy at avalon.zpm.be] entered the room.
(02:48:47 PM) ijchain: <dgp> program written in '95 == pre-unicode
(02:48:54 PM) _hc: http://pastebin.com/1kuntv3m
(02:49:04 PM) _hc: yup, we ported it to UTF-8
(02:49:32 PM) awb [~awb at 12.206.216.226] entered the room.
(02:49:37 PM) _hc: this problem is intermittent, it doesn't happen in
the same place everytime, sometimes the same code will work fine
14:50
(02:50:06 PM) ijchain: <dgp> help me with the paste.
(02:50:07 PM) _hc: if you look at the pastebin, that's exactly the code
from C(pd) --> Tcl(pd-gui)
(02:50:12 PM) ijchain: <dgp> what should I be looking at?
(02:50:21 PM) _hc: if you scroll down to the long list of numners
(02:50:26 PM) _hc: that's a long Tcl command
(02:50:40 PM) _hc: the spot where there are some blank lines is where it
gets cut sometimes
(02:50:52 PM) ijchain: <aku> And where is the break ?
(02:50:56 PM) ijchain: <aku> x-ed, ok
(02:51:03 PM) _hc: so the Tcl stacktrace says "(Tcl) INVALID COMMAND
NAME: invalid command name "450"
(02:51:03 PM) _hc:     while executing"
(02:51:10 PM) ijchain: <dgp> what happened at line 119 ?
(02:51:16 PM) _hc: yup, line 119
(02:51:31 PM) _hc: and that mark always lines up to be a 48k chunk of
code
(02:51:45 PM) ijchain: <aku> Well, the loop lines 11-14 in your code has
obviously collected only the lines up to 118 and the next fileevent call
got 119+
(02:51:46 PM) _hc: in the test that I have right now at least
(02:51:47 PM) ijchain: <dgp> This is the data as sent? as received?
(02:52:07 PM) _hc: aku, it works most of the time, and only sometimes
gets cut there
(02:52:18 PM) _hc: dgp that is the data received by Tcl
(02:52:40 PM) ijchain: <dgp> our logic of splicing things together must
be faulty
(02:52:41 PM) ijchain: <aku> That is why I tell you three times now,
that you have to check your buffer with [info complete], and wait for
the next fileevent call to supply the rest of the command
(02:52:47 PM) ijchain: <dgp> not taking escaped newlines into account
(02:53:10 PM) _hc: aku: we had code that used [info complete] in it
before, but it slowed things down a lot
(02:53:26 PM) ijchain: <dgp> correctness, then speed
(02:53:44 PM) _hc: a lot, like in hampering usability
(02:53:52 PM) _hc: usability, then correctness, then speed
(02:53:57 PM) ijchain: <dgp> no
(02:54:00 PM) ijchain: <aku> What I see in your code is that your buffer
(cmd_from_pd) is not global/persistent between fileevents also, which
you need for that.
(02:54:02 PM) qru: http://pastebin.com/1sfNZiLq
(02:54:28 PM) ijchain: <dgp> Your solution will probably end up being
stop inserting the escaped newlines.
(02:54:31 PM) qru: That works to auth with a splunk server in http 2.5
but not 2.7. Any ideas?
(02:54:43 PM) ijchain: <dgp> That's handy for readable code for humans. 
Does nothing here.
(02:54:54 PM) _hc: at line 60, you can see our [info complete]
implementaion:
http://pure-data.git.sourceforge.net/git/gitweb.cgi?p=pure-data/pure-data;a=blob;f=tcl/pd_connect.tcl;h=5bb3f29f9962a6c59c667208e2b2ec00446b91db;hb=HEAD
14:55
(02:55:00 PM) ijchain: <dgp> % info complete "foo bar\\\n"
(02:55:01 PM) ijchain: <dgp> 0
(02:55:06 PM) ijchain: <aku> Are you checking info complate after each
line received, or after the full block has been adeed to the buffer ?
(02:55:31 PM) _hc: blocks
(02:55:36 PM) _hc: but that was with [read]
(02:55:39 PM) ijchain: <dgp> qru, just a sec
(02:57:22 PM) ijchain: <dgp> qru, what's the value of $::http::strict
when working, and when not working?
(02:57:59 PM) _hc: the thing about this issue is that its intermittent,
most of the time everything works fine
(02:58:16 PM) _hc: but when it fails, it seems to fail in a regular way
(02:58:25 PM) _hc: i.e. the 48k block
(02:58:32 PM) _hc: maybe that's misleading
(02:58:52 PM) qru: dgp: I've tried both 1 and 0 with -strict and
::http::geturl
(02:59:13 PM) ijchain: <aku> networking and the slicing and dicing of a
character stream into packets, and re-assembly etc is non-deterministic.
Sometimes things align, sometimes they don't.
(02:59:31 PM) qru: Auth logs from Splunk show the attempt and the
username is correctly parsed.
(02:59:46 PM) qru: And the encoding is identical for the passwd.
(02:59:52 PM) qru: In each case.
(02:59:53 PM) ijchain: <aku> Looking at the read/info complete code. ...
15:00
(03:00:13 PM) ijchain: <aku> I do not understand why you are trying
multiple reads in the pd_readsocket.
(03:02:00 PM) ijchain: <aku> I.e. If the data is not complete you should
from the handler. You have to save cmd_from_pd of course, so that you
can add to it on the next call of the handler.
(03:02:13 PM) ijchain: <aku> s/should from/should return from
(03:03:40 PM) ***ijchain ferrieux has become available
(03:04:33 PM) ijchain: <aku> With [gets] you can get away with reading
multiple lines in the handler, until [gets] signals that the line is not
complete. But with [read] without a limit as in your code you have the
whole data the OS provided at this moment and a second read will not
provide more.
15:05
(03:05:07 PM) ijchain: <dgp> The busy [read] loop might explan perceived
poor performance though.
(03:05:57 PM) ijchain: <dgp> aku, isn't the [eof] test before reading a
misstep as well?
(03:07:16 PM) ijchain: <aku> Not really, because if [read] generated eof
you see it at the _next_ handler call. It makes code easier to write,
with checks and exceptions first, then read and handling the result of
that and able to simply return
(03:07:51 PM) ijchain: <aku> in the processing, no need to bring
everything back to a single point where we check eof
(03:11:11 PM) _hc: ok, that bit about waiting for the next fileevent
before calling [read] is interesting, I'll try that
(03:11:27 PM) _hc: any thoughts on whether [read] or [gets] will be more
efficient?
(03:11:35 PM) ijchain: <dgp> get it right first
(03:12:15 PM) ijchain: <dgp> _hc, I think the logic is wrong
(03:12:31 PM) ijchain: <dgp> hmmm....not sure.
(03:12:32 PM) _hc: dgp: which logic?
(03:12:43 PM) ijchain: <dgp> the while loop condition
(03:13:29 PM) _hc: the [info complete] while loop?
(03:13:30 PM) ijchain: <dgp> First you test whether it's [info
complete].
(03:13:47 PM) ijchain: <dgp> And then you test whether it ends in \n.
(03:14:36 PM) ***ijchain ferrieux wonder whether [info complete $x\n]
has been mentioned for the escaped-newline issue
(03:14:40 PM) ijchain: <ferrieux> wonders
15:15
(03:15:00 PM) ijchain: <dgp> not mentioned.
(03:15:08 PM) ijchain: <dgp> we can't even accomplish simple things.
(03:15:13 PM) ijchain: <ferrieux> ah
(03:15:26 PM) ijchain: <ferrieux> where's the current code ?
(03:16:09 PM) ijchain: <dgp> _hc, I'd at least reverse the tests.
(03:19:09 PM) ijchain: <dgp> you need [read -nonewline] ?
15:20
(03:21:01 PM) ijchain: <dgp> I give up; would be lots faster to just
give you a working solution instead of tinker with what you have.
_hc: aku kbk dgp thommey Gotisch: I think we have a winner, here's my
new proc: http://pastebin.com/0ZzC4Vcz
03:31:52 PM) Gotisch: _hc: what happens if you get set a "1"; foreach b
[c] {
(03:32:00 PM) ijchain: <stu> Heh:
(03:32:00 PM) ijchain: <stu> C == Slide Rule
(03:32:00 PM) ijchain: <stu> C++ == Guy with a piece of paper in front
of him and a pencil up his nose
(03:32:04 PM) Gotisch: or better
(03:32:08 PM) Gotisch: _hc: what happens if you get set a "1";\n foreach
b [c] {
(03:32:13 PM) Gotisch: \n isnt at end
(03:32:18 PM) Gotisch: and command is not complete
(03:33:21 PM) Gotisch: then again when foreach is finished it will work
again
(03:34:50 PM) ijchain: <aku> _hc - looks good
15:35
(03:35:24 PM) CIA-52: tclapps: Commit by patthoyts :: [Bug 3381002]:
Applied patch from to fix the capabilities version string.
(03:35:24 PM) CIA-52: apps/tkchat/ (ChangeLog tkchat.tcl)
(03:36:41 PM) ***ijchain rz has become available
(03:38:09 PM) _hc: Gotisch: I just ran a many megabytes of Tcl code
through that proc, and no errors yet
15:40
(03:41:28 PM) ***ijchain ferrieux has left
(03:41:30 PM) ***ijchain ferrieux has become available
(03:42:54 PM) ijchain: <rz> Hi, do anyone have additional hints (beside
the manual) about the cvs error: cvs [checkout aborted]: end of file
from server (consult above messages if any)
(03:43:26 PM) _hc: aku kbk dgp thommey Gotisch: thanks all so much! 
this proc seems noticeably faster, and has fixed the error so far!
http://pastebin.com/0ZzC4Vcz
(03:43:29 PM) ***ijchain pat has become available
(03:43:50 PM) ijchain: <ferrieux> _hc, do you realize that your approach
, picking an aligned \n every now and then, has a quadratic [info
complete] test ?
(03:44:18 PM) ijchain: <ferrieux> (I mean the sequence of calls is
quadratic in the block size)
(03:44:34 PM) _hc: I don't understand the quadtradic part
(03:44:59 PM) ijchain: <ferrieux> you keep calling [info complete] on
larger and larger blocks
15:45
(03:45:09 PM) ijchain: <ferrieux> until you get a simultaneously aligned
and complete one
(03:45:20 PM) ijchain: <ferrieux> right ?
(03:45:45 PM) _hc: I ran a test, 'e' means executed the [read] call; "N"
means no new line, wait for more data; "I" means failed [info complete],
wait for new data; this is what it looked like at worst:
eNNNNNNNNNNNNeNeeNNeNeeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNeeNNeeNeNeNeNeNeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeNeNeNeeNeNeeNNeeNNeeNeNeNeNeNeNNeeNNeeNeNeNeeNeNeeNeeNNeeNeNeNeNeNeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNeeNNeeNeNeNeNeNeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeNeNeeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNeeNNeeNeNeNeNeNeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNeeNeNeNeNeNeeNNeeNeNeeNeNeeeNeeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNNeNeeNeNNNNeNeNeNeNeeNNeeNeNeNeNeNeNeNeeNNeeNNeeNeNeNeNeNeeNNeeNNeeNeNeeNeNeNeeNNeeNeNeeNeeNeNeNeeNeeNeeNeNeeeee
(03:45:52 PM) _hc: I never saw an I
(03:46:04 PM) _hc: an "I"
(03:46:13 PM) Zarutian [~zarutian at 194-144-84-110.du.xdsl.is] entered the
room.
(03:46:19 PM) _hc: i.e. [info complete] is really just a fallback
(03:46:21 PM) ***ijchain patthoyts has become available
(03:46:30 PM) ijchain: <ferrieux> ok, on your data set it somehow works
(03:47:03 PM) ijchain: <ferrieux> but [gets]+[info complete] would have
done the job in amore robust way I think
(03:47:08 PM) _hc: I think the vast majority of the time, there is a
missing \n rather than more complicated missing things that info
complete would catch
(03:47:50 PM) _hc: [gets] reads line by line, I don't see how that
wouldn't be slower than [read]'s block b block
(03:48:42 PM) ***ijchain patthoyts has left
(03:48:52 PM) ijchain: <ferrieux> it's all buffered on input, so the
read() syscalls have the same granularity
(03:49:11 PM) ijchain: <ferrieux> and the extra work on gets replaces
the extra work on enormous [info complete]ß
15:50
(03:50:42 PM) ijchain: <ferrieux> but the trade-off depends on your data
of course
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-profiled-improved-pd_connect-pd_readsocket-to-elimin.patch
Type: text/x-diff
Size: 3289 bytes
Desc: not available
URL: <http://lists.puredata.info/pipermail/pd-dev/attachments/20120123/c1c86a99/attachment.patch>