how to handle socket timeout? [Unix Programming]

Prev: About some snippet on APUE
Next: mmap(MAP_SHARED) and msync(MS_INVALIDATE)

From: David Schwartz on 15 Jan 2008 11:03

On Jan 15, 6:50 am, Arkadiy <vertl...(a)gmail.com> wrote:

> Why do I have to use a 'VERSION' command to determine that the
> connection is dead rather than determine this based on the actual
> request?

Because you've already sent the request, so it's too late for it to
trigger a RST. Your side needs to be in the process of attempting to
send data in order to generate a packet that could trigger a RST.

> > It seems like if the server is unreachable, nothing you try will
> > succeed anyway, so it doesn't matter particularly much what you do.
>
> Does this mean
>
> 1) nothing sent will produce a response, so anything can be send to
> detect the condition, or
> 2) there is nothing we can do in the code to handle the situation?

Why does it matter what you do in this case? The server is
unreachable. Nothing will work. It doesn't matter what you do or try
to do to the connection. There will be no response, so you will have
no response to give. You can give up waiting whenever you want, but
with respect to the connection, what you send, and what you detect, it
doesn't matter.

DS

From: Arkadiy on 15 Jan 2008 11:59

On Jan 15, 11:03 am, David Schwartz <dav...(a)webmaster.com> wrote:

> Why does it matter what you do in this case? The server is
> unreachable. Nothing will work. It doesn't matter what you do or try
> to do to the connection. There will be no response, so you will have
> no response to give. You can give up waiting whenever you want, but
> with respect to the connection, what you send, and what you detect, it
> doesn't matter.

It does for me. I want to start attempts to reconnect, in a different
thread, through the configurable interval of time. Once the server
stops being unreachable, the connect will succeed, and I will have
connections in my pool, ready to use. Until then, all my requests
immediately return failure since there is no available connections.
But, once the server stops being unreachable, the requests start
succeeding.

I think this model allows me to achieve the goal of "fast result or no
result". The notion of "fast" is configurable by the timeout value.

Let's say I setup all the facility on the LAN, and the server's
average response time is 0.1 msec. If I set the timeout to, for
example, 1 msec, everything will go smoothly most of the time. When I
get a timeout, this may mean one of a few things:

1) The server became unreachable. I want to close the socket and
start attempts to reconnect;

2) Accidentially long response. Still nothing wrong with
reconnecting, since this happens rearly (how rarely -- can be
controlled by the value of the timeout);

3) Server is congested. This is the worst case. But, IMO, this is
the case where nothing can be done. Except adding another server
instance to split the load.

4) Network is congested. Again, I don't see what can be done in this
case.

So it seems to me that reconnect works OK for both cases where
anything can be done -- I just need to setup reasonably large timeout
-- sufficiently larger than the average response time under normal
conditions. For two other cases, it seems yes, I am adding a bit to
already existing mess. But does this really matter?

Am I missing something?

Regards,
Arkadiy

From: David Schwartz on 15 Jan 2008 12:10

On Jan 15, 8:59 am, Arkadiy <vertl...(a)gmail.com> wrote:

> Let's say I setup all the facility on the LAN, and the server's
> average response time is 0.1 msec. If I set the timeout to, for
> example, 1 msec, everything will go smoothly most of the time. When I
> get a timeout, this may mean one of a few things:
>
> 1) The server became unreachable. I want to close the socket and
> start attempts to reconnect;
>
> 2) Accidentially long response. Still nothing wrong with
> reconnecting, since this happens rearly (how rarely -- can be
> controlled by the value of the timeout);
>
> 3) Server is congested. This is the worst case. But, IMO, this is
> the case where nothing can be done. Except adding another server
> instance to split the load.

You can avoid adding to the server load so that it has a hope of
catching up.

> 4) Network is congested. Again, I don't see what can be done in this
> case.

You can avoid adding to network congestion so that it has a hope of
abating.

> So it seems to me that reconnect works OK for both cases where
> anything can be done -- I just need to setup reasonably large timeout
> -- sufficiently larger than the average response time under normal
> conditions. For two other cases, it seems yes, I am adding a bit to
> already existing mess. But does this really matter?

It won't matter too much as long as you keep the rate under control
and the number of connections under control. You won't cause much
trouble because TCP has its own rate-limiting to protect the network.
You may cause the server some pain because of the rate of connection
establishment and teardown, but it shouldn't be horribly bad.

You really want to backoff and retry though. And I'm not sure you want
to tear down the connection at the first sign of trouble.

You can also use connection establishment to verify that the server is
operational. If you can set up a new connection to it, it's not dead.
However, sending a 'version' command will have (approximately) the
same effect.

Make sure your write aggregation/buffering is sufficient to ensure
that Nagle doesn't bit you.

DS

From: Rainer Weikusat on 15 Jan 2008 12:47

Arkadiy <vertleyb(a)gmail.com> writes:
> On Jan 15, 11:03 am, David Schwartz <dav...(a)webmaster.com> wrote:

[...]

> I think this model allows me to achieve the goal of "fast result or no
> result". The notion of "fast" is configurable by the timeout value.
>
> Let's say I setup all the facility on the LAN, and the server's
> average response time is 0.1 msec. If I set the timeout to, for
> example, 1 msec, everything will go smoothly most of the time. When I
> get a timeout, this may mean one of a few things:
>
> 1) The server became unreachable. I want to close the socket and
> start attempts to reconnect;

This will result in a FIN being transmitted to the server, after which
the kernel waits for a FIN-ACK coming from there and sends an ACK. If
the server is unreachable, this will not work, and if it is just
responding slowly, your close request will be processed after all
other requests you have sent and the connection will be closed after
you have processed all replies to these requests (this is the default
behaviour, hackarounds would be possible).

> 2) Accidentially long response. Still nothing wrong with
> reconnecting, since this happens rearly (how rarely -- can be
> controlled by the value of the timeout);

This is actually situation 3: The server cannot reply fast enough.

> 3) Server is congested. This is the worst case. But, IMO, this is
> the case where nothing can be done. Except adding another server
> instance to split the load.

The easy thing which could be done is to not increase the load in the
server by torturing it with completely useless connection teardown and
re-establishment requests.

> 4) Network is congested. Again, I don't see what can be done in this
> case.

Same as above. Avoid injecting more useless packets.

To repeat this again: TCP is a reliable bytestream protocol based on
persistent virtual circuits. Either you want a reliable bytestream
protocol, then you should just be using it, or you don't want a
reliable bytestream protocol and then don't use it.

For practical purposes, the daemon you are talking to is closely
similar to a network file system server and the way you want to use it
would lend itself to the 'traditional' way NFS worked: Assume the
server is stateless (gets around all possible issues with crashes
etc), send a request using UDP when you want to request some data,
possibly, retransmit that a couple of times using a 'standard'
exponential backoff algorithm and process eventual replies when and if
they arrive, dropping whatever you don't want anymore.

This is exactly the same procedure one would sensibly use for TCP,
except that it is not necessary to deal with connections anymore.

[...]

> Am I missing something?

An introduction in internetworking protocols, maybe?

From: Arkadiy on 15 Jan 2008 13:15

On Jan 15, 12:10 pm, David Schwartz <dav...(a)webmaster.com> wrote:

> You really want to backoff and retry though. And I'm not sure you want
> to tear down the connection at the first sign of trouble.

My problem is I can't understand the purpose of this "retry". If my
timeout is 1 sec, and the first request timed out, and I retry it, why
not to set the timeout to 2 sec in the first place?

Regards,
Arkadiy

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: About some snippet on APUE
Next: mmap(MAP_SHARED) and msync(MS_INVALIDATE)