From: Peter Duniho on
Michael Ober wrote:
> [...]
> I can duplicate this with a simple connect/disconnect. The problem
> occurs on the connection.

According to your original problem statement, the problem occurs _after_
there has been some activity, followed by a period of idle time, and
then a new connection. Until you've completely isolated the problem,
it's not possible to rule out potential dependencies on the client with
respect to the problem.

If you _can_ in fact reproduce the problem with a simple connect,
perhaps after some specific period of being idle, even so...you need to
provide the code for that. If you expect people to reproduce the
problem, you can't expect them to write new code to do that.

If you are satisfied with conjecture, then don't provide a
concise-but-complete code example.

> [...]
> Let me see if I understand. Dump the two class level variables and pass
> the class (Me) through the Begin... method. Then pull the class out of
> the AsyncState property for access to the buffers.

I'm not sure I understand the above. Generally, an i/o system will
track two kinds of data structures: "per-connection", and
"per-operation". You seem to have a per-connection structure now. That
structure would not contain any per-operation data (i.e. buffers).
Instead, you'd have a separate data structure for that data.

> [...]
> I have made several changes based on your comments. I haven't touched
> the _bufOut or the _SendResults variables, but the _ReceiveResults
> variable was actually not used anywhere and is now gone. I've started
> the program inside the VS debugger. I've confirmed that it is accepting
> connections and returning results today and tomorrow morning I'll see
> what happens when I try to connect.

Just to be clear: I have not identified any "smoking gun" in the code
you posted. There are a number of things that I feel are potential
issues, at least with respect to maintainability if not correctness.
But none are obviously causing problems, never mind the problem you
describe.

I remain curious about the answers to the questions I asked in my other
reply, with respect to your comments that the problem started occurring
after you had other problems with the computer and had recompiled the
code that had been working fine for nearly a year.

In spite of all the "fishy" things about the code, it's entirely
possible that there's nothing wrong with the code that would lead to
incorrect behavior, and that the problem lies elsewhere.

Pete
From: Peter Duniho on
Goran Sliskovic wrote:
> CLOSE_WAIT means that the TCP connection has been terminated, but the
> socket descriptor has not been closed (by calling close by application,
> of course). There is no possibility that there are messages still in
> transit (at least messages that could do any damage). TIME_WAIT state is
> there to ensure this.

Sorry�I have confused the two. Thanks.

> The socket is bound in OP's case because it is not closed, for whatever
> reason. Connection is closed, however socket is not.

The OP mentions another case where the connection indefinitely remains
in CLOSE_WAIT even after the process exits. Can you think of a reason
that might happen? I couldn't.

Pete
From: Goran Sliskovic on

"Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
news:eUPItEJfKHA.5020(a)TK2MSFTNGP02.phx.gbl...
....
> The OP mentions another case where the connection indefinitely remains in
> CLOSE_WAIT even after the process exits. Can you think of a reason that
> might happen? I couldn't.
>
> Pete

Once I saw this with WebRequest class. Either .NET uses some other process
that cleans up the socket (or some other class) or process is not shown in
task manager but is still somehow active.But this would be internals of
windows/.net that I'm not familiar with. I'm pretty sure that exiting the
process should close all sockets.
The OP problem looks to me as a case of deadlock, possibly in the finalizer.
The only way to see what is going on is to attach debugger (eg. Windbg with
SOS.dll).

Regards,
Goran

From: Michael Ober on

"Goran Sliskovic" <gsliskov(a)yahoo.com> wrote in message
news:%23BVPjuQfKHA.5608(a)TK2MSFTNGP05.phx.gbl...
>
> "Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
> news:eUPItEJfKHA.5020(a)TK2MSFTNGP02.phx.gbl...
> ...
>> The OP mentions another case where the connection indefinitely remains in
>> CLOSE_WAIT even after the process exits. Can you think of a reason that
>> might happen? I couldn't.
>>
>> Pete
>
> Once I saw this with WebRequest class. Either .NET uses some other process
> that cleans up the socket (or some other class) or process is not shown in
> task manager but is still somehow active.But this would be internals of
> windows/.net that I'm not familiar with. I'm pretty sure that exiting the
> process should close all sockets.
> The OP problem looks to me as a case of deadlock, possibly in the
> finalizer. The only way to see what is going on is to attach debugger (eg.
> Windbg with SOS.dll).
>
> Regards,
> Goran

Unfortunately I can't attach a debugger to the XP system, but I'll try the
updated server code on Wednesday to see if the problem is corrected.

Mike.

From: Michael Ober on
Peter,

I think I found it this morning. It actually wasn't in the IPServer
namespace at all. Using Remote Desktop, I started the server inside the
debugger on my workstation yesterday and checked it a couple of times during
the day several hours apart. No problems - making the "long idle" period
unlikely. This morning I put a breakpoint on the OnAccept method and
connected, single stepping through it. Inside this method there is a call
to WriteLog, which creates consistent log files across multiple Windows
development environments (not dotNet logging) and during the change of day
processing an infinite recursive loop occurred resulting in a stack
overflow. I fixed that problem and tested by changing the clock on my
system. I'll check tomorrow morning to verify. What baffles me is why I
didn't get a Stack Overflow error on the console, even in a Debug release.
The My Namespace event handlers were a late addition to the application and
I would have expected a Stack Overflow error to be reported prior to my
adding them. That would have made it a lot quicker to find and fix.

I think I corrected all but the _SendResults and _bufOut issues that you
pointed out and I'm simply not sure how to fix that one, short of using
SyncLock. Assuming the server is working tomorrow morning, I'll post the
updated code tomorrow evening.

I also realized that creating a disconnect timer for each connection was
extremely inefficient so I recoded this feature to use a single timer in the
IPServer class and a decrement counter in each ClientInformation object.
Since timeouts are always measured in minutes and the exact timeout period
isn't critical, this change too hard to do. The decrement counter is set to
the max inactivity time passed to the IPServer.New method and decremented
once a minute by a timertick handler in the IPServer class itself. The Send
and Receive methods of the ClientInformation class reset this counter to the
time out value. Yes, there can be a race condition here, but if it occurs,
one of two things will happen. Either the server will break down the
connection, which still complies with the server's contract with its
clients, or the connection timeout will be reset, which will simply keep the
connection alive longer. In either case, if the client disconnects, the
EndRecieve method of the socket will return 0 bytes and the connection will
be shutdown by the server.

The events being raised were really for user feedback, so I combined them
into a new Event that better describes what it's for. This broke the
interface "contract" presented by the IPServer class to the application, but
as there were only two applications using this class, that wasn't a big
deal. If this had been for external use or there had been a lot of
applications using this interface, I would have added the new event and left
the old ones there.

The new code actually hides the Clients List and the ConnectionInformation
class by making them private to the IPServer class as well as flattening the
NameSpace. The original desired design was to inherit this class and
override the ProcessMessage method for the business logic, hiding all the
complexity of handling the TCP communications. There wasn't supposed to be
external access to the client connections. I was close previously, but not
quite there. This has the additional benefit of allowing multiple IPServers
to exist in the same application without the possibility of them interfering
with each other. Each IPServer override could then implement different
business logic, simplifying the ProcessMessage override. In the environment
I work in, this is useful as I have a server that collects information from
multiple sources and updates multiple external applications. This
particular application is the other one that uses this class. It uses the
IPServer class and the MSMQ interface to collect data and it updates three
different systems, providing an integrated view of the data in a near
real-time environment. The vendor of the external system would like to do
away with the MSMQ interface and go strictly TCP for updates.

As for the recompile - I'm actually surprised this application worked at all
during date changes. After the rebuild I did restore from tape and the
restored executable also crashed. The WriteLog stack overflow problem has
been in the code since it was ported from VC6 and VB6 to VB2005 several
years ago. This bug doesn't exist in the VC6 version as I have another
application server, writtin in VC++/MFC, running on the same server that
doesn't exhibit this problem. I don't know if it exists in the VB6 version
of the class and will probably never find out as all new development is in
C# and VB 2008. In my environment, there are no extended execution TCP
servers written in VB6. I used VC++/MFC for those as it was much easier for
me to handle TCP messages in VC++/MFC than in VB6.

Once again, thanks for all your feedback.

Mike Ober.

"Peter Duniho" <no.peted.spam(a)no.nwlink.spam.com> wrote in message
news:eXhLzCJfKHA.5292(a)TK2MSFTNGP06.phx.gbl...
> Michael Ober wrote:
>> [...]
>> I can duplicate this with a simple connect/disconnect. The problem
>> occurs on the connection.
>
> According to your original problem statement, the problem occurs _after_
> there has been some activity, followed by a period of idle time, and then
> a new connection. Until you've completely isolated the problem, it's not
> possible to rule out potential dependencies on the client with respect to
> the problem.
>
> If you _can_ in fact reproduce the problem with a simple connect, perhaps
> after some specific period of being idle, even so...you need to provide
> the code for that. If you expect people to reproduce the problem, you
> can't expect them to write new code to do that.
>
> If you are satisfied with conjecture, then don't provide a
> concise-but-complete code example.
>
>> [...]
>> Let me see if I understand. Dump the two class level variables and pass
>> the class (Me) through the Begin... method. Then pull the class out of
>> the AsyncState property for access to the buffers.
>
> I'm not sure I understand the above. Generally, an i/o system will track
> two kinds of data structures: "per-connection", and "per-operation". You
> seem to have a per-connection structure now. That structure would not
> contain any per-operation data (i.e. buffers). Instead, you'd have a
> separate data structure for that data.
>
>> [...]
>> I have made several changes based on your comments. I haven't touched
>> the _bufOut or the _SendResults variables, but the _ReceiveResults
>> variable was actually not used anywhere and is now gone. I've started
>> the program inside the VS debugger. I've confirmed that it is accepting
>> connections and returning results today and tomorrow morning I'll see
>> what happens when I try to connect.
>
> Just to be clear: I have not identified any "smoking gun" in the code you
> posted. There are a number of things that I feel are potential issues, at
> least with respect to maintainability if not correctness. But none are
> obviously causing problems, never mind the problem you describe.
>
> I remain curious about the answers to the questions I asked in my other
> reply, with respect to your comments that the problem started occurring
> after you had other problems with the computer and had recompiled the code
> that had been working fine for nearly a year.
>
> In spite of all the "fishy" things about the code, it's entirely possible
> that there's nothing wrong with the code that would lead to incorrect
> behavior, and that the problem lies elsewhere.
>
> Pete