[PD] pd and tcp: what to do against crashes?

Sat Feb 21 15:37:15 CET 2009

hi all

i've been working now quite some time with setups, where different
instances of pd spread over the world are connected with each other over
another instance of pd (i.e. serverpatch). i tried different classes for
establishing tcp connections between clients and servers, namely
[netclient]/[netserver], [tcpclient]/[tcpserver] or a mix of the two. no
matter, what configuration is used, server crashes are likely to happen
from time to time. the 'server' means here the instance of pd, that is
running the patch containing either [tcpserver] or [netserver]. crash
means: pd is still running, but not responding. when i start pd with gui
for debugging purposes, the gui is also still there, but doesn't
respond.

when i am testing on my own, running several instance of pd on my local
box (or on some more boxes, i have access to), everything runs fine,
even under heavy load of data being exchanged between the clients. at
most, there are some drop-outs, but never crashes. however, when having
a netpd-session with several people connected from everywhere, crashes
happen much more often. from my experience, i can tell, that those
crashes are more likely to happen, if one or more clients have an
unreliable internet connection  (or weak wifi signal etc). since tcp is
connection-aware - tcp requires connection establishment (handshake) but
also connection termination - and some clients just disappear without
proper termination, the server still expects them to be there. this is
also indicated by the number of connected clients reported by the
server: when a client loses connection and then reconnects, the number
is higher than the real number of connected clients. if this happens
several times, the reported number of connected clients raises, because
connections weren't terminated correctly. 

now, when another client is sending 'broadcast' messages (messages meant
to be sent to all connected clients), the server still tries to send the
messages to the disappeared clients. 
another situation: if the client, that disappeared, sent a dump request
to another client just before vanishing, the other client will try to
send the whole dump to the vanished client. i wonder now, what happens,
if all those messages cannot be delivered by the server. i suspect this
to be the cause of the crashes.

from the pd user side, there seems to be no way to address this issue,
since there is no way for the server (i.e. the patch around
[netserver]/[tcpserver]) to tell, if a client silently disappeared. so
the server will still try to deliver all the messages. i am suspecting,
that some buffer overrun occurs here, but i cannot tell really without
understanding the code of [netserver] or [tcpserver]. also i don't know,
at which level those buffer overruns would happen: somewhere in the
external (netserver/tcpserver) code, in the pd code, or even in the
kernel/OS? the only thing, that i know, is that i haven't seen apache or
some other tcp server crashing because of clients having bad connection.
so there must be a solution to this problem, but i don't know where to
look for it. another problem is that, from a pd user perspective, one
has very little control over the things happening at tcp level. if you
need to send a big amount of data, there is no mechanism provided to
send the data at maximum available bandwidth. so you either send
everything at once, which fills the internal 4kb buffer of [net*] or
[tcp*], so that a long drop-out occurs, until the buffer is emptied
again. or the data is sent with time intervals between  each message in
order to artificially reduce the bandwidth used. the latter approach has
the disadvantage of not using the whole available bandwidth. also, in
userspace you don't see, if a message could be delivered or not, which
will, as described in above situations, lead to the  situation, that
more messages will be sent to a non-existing receiver, which might fill
some buffer, which _probaly_ leads to a crash of pd. 

because above problems, i came to the conclusion, that it is currently
not possible to have several instances of pd connected with each other
without the system  (i.e. one or more instances of pd) crashing from
time to time. i know, that pd's main goal is computing audio and not
networking, but still it would be a big benefit, if the the audio and
networking would reliably work together in pd. 
currently, i don't know what is the best approach to face those issues:
giving more control to the userspace or make the net classes of pd less
prone to clients not behaving 'correctly' at tcp level. i do know, that
i will not be able to fix those issues myself, therefor i would like
see, if more people are interested in helping to work this out. or if
people think, that pd is the wrong tool to work with such setups, i
would like to know that as well. 

oops.. sorry for the long post..

roman

___________________________________________________________ 
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de