From: Michael Ober on
Let me clarify - in one post I said it was a disk crash and here I say it's
a VMWare guest. Both statements are true. We had a series of power hits
close enough together that our UPS systems couldn't get clean shutdowns on
the final power hit. The result was that the host file system was corrupted
which led to the VMWare files that comprise the guest OS filesystem being
damaged. The end result was that we lost two of our virtual servers to what
were essentially disk crashes. I had to rebuild and then restore services
from tape both servers. Fortunately the actual disks weren't physically
damaged - just the logical file system on the disk was damaged. Needless to
say it wasn't a fun weekend to recover. The application I posted the code
from has been running since last winter with no problems. I recompiled and
released and have been having problems since then.

>
> It's a VMWare guest. I just checked the guest OS from home and I'll check
> the host on Monday. Thanks for the idea.
>
> Mike.

From: Peter Duniho on
Michael Ober wrote:
> [...] The application I posted the code from has been running
> since last winter with no problems. I recompiled and released and have
> been having problems since then.

What happens when you go back to the previous version of the
application? What changed? If nothing, why recompile at all?

If the previous version, which was working fine for nearly a year,
exhibits the same failures as the new version, perhaps you simply need
to do a fresh install of the OS and VM guest.

I have one other quick comment in reply to your other post...I'll post a
separate message.

Pete
From: Peter Duniho on
Michael Ober wrote:
> [...]
> The IPServer class is actually used in two different server
> applications. In this particular instance, which runs on Windows Server
> 2003 R2, I can indeed wait for the TCP CLOSE_WAIT - it's faster to
> reboot.

Note: rebooting is bypassing the CLOSE_WAIT. The same issues exist when
you do that.

> In the other application, which runs on Windows XP, it appears
> that XP never closes the listening socket. I suspect the XP behavior
> has more to do with another server that must run on that particular box
> for which I have no source code than it does with XP itself.

Not sure what "the other application" is. But it is strange if a socket
remains in CLOSE_WAIT indefinitely, for an application that has crashed
(CLOSE_WAIT for a socket owned by a process that hasn't exited is
another matter entirely).

> [...]
> The other end is a currently a standard TCP client. It sends a request
> and waits for the response. The client applications do have to be
> coded so that if the server timeout expires they detect the dropped
> connection and reconnect if they need to send again.

A code example that can be used to reproduce the problem _must_ include
the client side. Saying it's "a standard TCP client" doesn't provide
the code needed to exercise the server. In addition, with any networked
application, it is possible that a problem is a consequence of the
interaction between client and server. There needs to be code for the
client side that is known to reproduce the same problem when used with
the server side.

>> That said, here are some highlights of what appear to me to be problem
>> areas. Some may even be related to the problem you're seeing:
>>
>> -- Use of "On Error Resume Next". Don't do this. You are
>> basically just ignoring exceptions. Bad. Even in VB.NET, it's bad.
>
> I only use this where I don't care about the error but do want each
> instruction to attempt execution. Is there a clean way of handling this
> with try ... catch ... end try?

You should only catch errors that you know you can do something with.
Sometimes, at the very highest level in your code, it may make sense to
catch any error and report it to the user, explaining that the operation
couldn't complete.

Even in that case, you need to ensure at all points in the call stack
that you haven't left any indeterminate state, and of course at the
lower levels in your code (deeper in the call stack), it is appropriate
to catch and handle exceptions only if there is a known reasonable,
correct algorithm for safely handling the exception and proceeding
without propagating the exception further up.

Catching and ignoring errors can leave your code open to any number of
corrupted data scenarios, because you obviously can't anticipate every
single error that might occur. And an error you haven't anticipated is
best handled by either not handling it at all (letting your program
crash, however painful that may be), or at least handling it only at the
top-most level and discarding any work that might have taken place.

> [...]
> > -- Use of ReaderWriterLockSlim.IsReadLockHeld and
>> .IsWriteLockHeld. It's bad enough that, contrary to the
>> documentation's instructions, you are using the property to control
>> program flow. But to do it in a loop?
>
> After static analisys of the code, a finally clause on each of the two
> try blocks was sufficient to handle this. Since OnReceive would be
> called in different threads and not be reentrant, the loops can safely
> be removed. The purpose of this was to ensure I didn't have a
> ReaderWriterLockSlim causing a connection to stall.

I inferred the purpose. But, if there's a perceived need to write the
loop, that implies some re-entrancy. And if there's some re-entrancy,
then having one level in your call stack remove _all_ the lock
acquisitions that thread obtained means that the higher levels in your
call stack no longer own the lock and are no longer protected.

Of course, in your case you have asserted there is no re-entrancy, and I
believe that is the case (based on the code posted), but that obviously
means that the loop is not helpful even in a broken way.

As you have probably done now, the proper way is to acquire the lock,
then release it explicitly. There should never be any need to check to
see whether you have the lock. The flow of your code should provide
that guarantee. At worst, you may have a local variable that records
whether you've successfully acquired the lock, and which is checked
later in the method before releasing.

>> -- Unsynchronized access to the _ReceiveResults and _SendResults
>> members. You should at a minimum be using Thread.VolatileRead() and
>> .VolatileWrite(). Even better would be to just use SyncLock or similar.
>>
>> -- That said, the mere presence of _ReceiveResults and
>> _SendResults is a red flag to me. It's possible that with a purely
>> transactional connection, this could work, and maybe that's what
>> you're dealing with. But it's bad form in any case, and if you have
>> _any_ possibility of overlapping send or receive operations, you've
>> got big trouble. This design is a maintenance problem, and
>> potentially a serious one.
>
> Do you have a suggestion for a better design?

Sure. Normally, asynchronous code just keeps the state information with
the operation. That is, each operation creates its own data structure
related to the operation. Typically, you only ever see this data
structure as something passed around (e.g. to a callback), and of course
that's the kind of data structure you have here. Your callback is
passed the data structure; there should be no need to store it within
the class itself.

> I'm trying to keep this
> as asynchronous as possible. The design concept was that if the routine
> MustOverride routine has to do a lot of work and send back large numbers
> of results, it can call the underlying send method multiple times before
> returning, although looking at the code, I don't provide that particular
> routine the ability to access the Send method.
>
> Like I said in my original post, this program seems to lock up when
> there are long periods (measured in hours) of no clients. If access to
> these two variables was a problem, I would expect problems to show up
> during heavy usage, not during idle periods. This server class is also
> used in another application running on an XP SP3 system that is under
> significantly heavier load with more multithreading going on than this
> particular application.

Based on the code you posted, one possible scenario for the application
getting stuck is that the variable containing the IAsyncResult for the
send is somehow optimized such that the assignment is never seen by the
thread waiting for it to become null.

Upon closer examination, it turns out that the variable that's
potentially the problem here is actually the _bufOut variable, not the
results variables (those are problematic for other reasons, but I don't
think they would lead to the code getting stuck). With no
synchronization for the array, you may set its length to 0 in one
thread, but have another thread never see the length as 0, and so get
stuck waiting for it to be 0.

Is this actually your problem? I don't know. But I do know that you've
got asynchronous code, where different threads may be sharing a
variable, and sharing a variable between threads without any form of
synchronization can lead to incorrect behavior in the code.

Pete
From: Michael Ober on

"Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
news:O3DIgZ8eKHA.4112(a)TK2MSFTNGP06.phx.gbl...
> Michael Ober wrote:
>> [...]
>> The IPServer class is actually used in two different server applications.
>> In this particular instance, which runs on Windows Server 2003 R2, I can
>> indeed wait for the TCP CLOSE_WAIT - it's faster to reboot.
>
> Note: rebooting is bypassing the CLOSE_WAIT. The same issues exist when
> you do that.
>

Yes it does, but since each client query is stateless, this turns out to be
a non-issue.

>> In the other application, which runs on Windows XP, it appears that XP
>> never closes the listening socket. I suspect the XP behavior has more to
>> do with another server that must run on that particular box for which I
>> have no source code than it does with XP itself.
>
> Not sure what "the other application" is. But it is strange if a socket
> remains in CLOSE_WAIT indefinitely, for an application that has crashed
> (CLOSE_WAIT for a socket owned by a process that hasn't exited is another
> matter entirely).

I have actually watched the XP box's socket status and the CLOSE_WAIT never
clears. I have strong suspicions that the other application is mucking with
the IP stack on that machine.

>
>> [...]
>> The other end is a currently a standard TCP client. It sends a request
>> and waits for the response. The client applications do have to be coded
>> so that if the server timeout expires they detect the dropped connection
>> and reconnect if they need to send again.
>
> A code example that can be used to reproduce the problem _must_ include
> the client side. Saying it's "a standard TCP client" doesn't provide the
> code needed to exercise the server. In addition, with any networked
> application, it is possible that a problem is a consequence of the
> interaction between client and server. There needs to be code for the
> client side that is known to reproduce the same problem when used with the
> server side.
>

I can duplicate this with a simple connect/disconnect. The problem occurs
on the connection.

>>> That said, here are some highlights of what appear to me to be problem
>>> areas. Some may even be related to the problem you're seeing:
>>>
>>> -- Use of "On Error Resume Next". Don't do this. You are basically
>>> just ignoring exceptions. Bad. Even in VB.NET, it's bad.
>>
>> I only use this where I don't care about the error but do want each
>> instruction to attempt execution. Is there a clean way of handling this
>> with try ... catch ... end try?
>
> You should only catch errors that you know you can do something with.
> Sometimes, at the very highest level in your code, it may make sense to
> catch any error and report it to the user, explaining that the operation
> couldn't complete.
>
> Even in that case, you need to ensure at all points in the call stack that
> you haven't left any indeterminate state, and of course at the lower
> levels in your code (deeper in the call stack), it is appropriate to catch
> and handle exceptions only if there is a known reasonable, correct
> algorithm for safely handling the exception and proceeding without
> propagating the exception further up.
>
> Catching and ignoring errors can leave your code open to any number of
> corrupted data scenarios, because you obviously can't anticipate every
> single error that might occur. And an error you haven't anticipated is
> best handled by either not handling it at all (letting your program crash,
> however painful that may be), or at least handling it only at the top-most
> level and discarding any work that might have taken place.
>
>> [...]
>> > -- Use of ReaderWriterLockSlim.IsReadLockHeld and
>>> .IsWriteLockHeld. It's bad enough that, contrary to the documentation's
>>> instructions, you are using the property to control program flow. But
>>> to do it in a loop?
>>
>> After static analisys of the code, a finally clause on each of the two
>> try blocks was sufficient to handle this. Since OnReceive would be
>> called in different threads and not be reentrant, the loops can safely be
>> removed. The purpose of this was to ensure I didn't have a
>> ReaderWriterLockSlim causing a connection to stall.
>
> I inferred the purpose. But, if there's a perceived need to write the
> loop, that implies some re-entrancy. And if there's some re-entrancy,
> then having one level in your call stack remove _all_ the lock
> acquisitions that thread obtained means that the higher levels in your
> call stack no longer own the lock and are no longer protected.
>
> Of course, in your case you have asserted there is no re-entrancy, and I
> believe that is the case (based on the code posted), but that obviously
> means that the loop is not helpful even in a broken way.
>
> As you have probably done now, the proper way is to acquire the lock, then
> release it explicitly. There should never be any need to check to see
> whether you have the lock. The flow of your code should provide that
> guarantee. At worst, you may have a local variable that records whether
> you've successfully acquired the lock, and which is checked later in the
> method before releasing.
>

Actually - the examples with the documentation shows the following

Try
Acquire lock
catch exception
finally
Release lock
end try

I have switched the code to this structure for the two locks.

>>> -- Unsynchronized access to the _ReceiveResults and _SendResults
>>> members. You should at a minimum be using Thread.VolatileRead() and
>>> .VolatileWrite(). Even better would be to just use SyncLock or similar.
>>>
>>> -- That said, the mere presence of _ReceiveResults and _SendResults
>>> is a red flag to me. It's possible that with a purely transactional
>>> connection, this could work, and maybe that's what you're dealing with.
>>> But it's bad form in any case, and if you have _any_ possibility of
>>> overlapping send or receive operations, you've got big trouble. This
>>> design is a maintenance problem, and potentially a serious one.
>>
>> Do you have a suggestion for a better design?
>
> Sure. Normally, asynchronous code just keeps the state information with
> the operation. That is, each operation creates its own data structure
> related to the operation. Typically, you only ever see this data
> structure as something passed around (e.g. to a callback), and of course
> that's the kind of data structure you have here. Your callback is passed
> the data structure; there should be no need to store it within the class
> itself.
>

Let me see if I understand. Dump the two class level variables and pass the
class (Me) through the Begin... method. Then pull the class out of the
AsyncState property for access to the buffers.

>> I'm trying to keep this as asynchronous as possible. The design concept
>> was that if the routine MustOverride routine has to do a lot of work and
>> send back large numbers of results, it can call the underlying send
>> method multiple times before returning, although looking at the code, I
>> don't provide that particular routine the ability to access the Send
>> method.
>>
>> Like I said in my original post, this program seems to lock up when there
>> are long periods (measured in hours) of no clients. If access to these
>> two variables was a problem, I would expect problems to show up during
>> heavy usage, not during idle periods. This server class is also used in
>> another application running on an XP SP3 system that is under
>> significantly heavier load with more multithreading going on than this
>> particular application.
>
> Based on the code you posted, one possible scenario for the application
> getting stuck is that the variable containing the IAsyncResult for the
> send is somehow optimized such that the assignment is never seen by the
> thread waiting for it to become null.
>

I'll take a look at that.

> Upon closer examination, it turns out that the variable that's potentially
> the problem here is actually the _bufOut variable, not the results
> variables (those are problematic for other reasons, but I don't think they
> would lead to the code getting stuck). With no synchronization for the
> array, you may set its length to 0 in one thread, but have another thread
> never see the length as 0, and so get stuck waiting for it to be 0.
>

I have made several changes based on your comments. I haven't touched the
_bufOut or the _SendResults variables, but the _ReceiveResults variable was
actually not used anywhere and is now gone. I've started the program inside
the VS debugger. I've confirmed that it is accepting connections and
returning results today and tomorrow morning I'll see what happens when I
try to connect.

> Is this actually your problem? I don't know. But I do know that you've
> got asynchronous code, where different threads may be sharing a variable,
> and sharing a variable between threads without any form of synchronization
> can lead to incorrect behavior in the code.
>

Thanks for that observation.

> Pete

I appreciate you taking the time to help.

Mike.

From: Goran Sliskovic on

"Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
news:OnE9wC5eKHA.1596(a)TK2MSFTNGP06.phx.gbl...
> Michael Ober wrote:
>> The code below appears to work, but it eventually freezes and I have to
>> reboot the server. Killing and restarting the program itself doesn't
>> work as the listener socket is still bound.
>
> For what it's worth, if you wait for the TCP CLOSE_WAIT to finish, you'll
> be able to restart the server. While it's possible to do so, you don't
> want to bypass the CLOSE_WAIT, because there is theoretically the
> possibility of network messages still in transit that, if received after
> you've restarted the server, could corrupt the state of your server.
>
....

CLOSE_WAIT means that the TCP connection has been terminated, but the socket
descriptor has not been closed (by calling close by application, of course).
There is no possibility that there are messages still in transit (at least
messages that could do any damage). TIME_WAIT state is there to ensure this.

The socket is bound in OP's case because it is not closed, for whatever
reason. Connection is closed, however socket is not.

Regards,
Goran