TCP connection break / re-connect issue

CoastalMaineBird · ‎12-01-2014

LV2013 on Win7, LVRT2013 on PXI-8196

Well, I thought I knew all there is to know about TCP connections - I’ve been using them for years, but I have found a new wrinkle.

I’m using a pattern I’ve used before, except this time I’m trying to automatically re-establish a broken connection:

ConnID = NaN

repeat

if ConnID = NaN // if there is no connection...

ConnID = TCP OPEN CONNECTION (Address, Port, Timeout = 1000)

if ConnID = NaN // could not connect

Flag = FALSE // make sure light is off

else

Issue CONNECT event (Conn ID) // internal event

CONNECTED LIGHT. ObjHighlight ( ); // draw attention to the change in status

Flag = TRUE // make sure light is on

end if

else

Command = RECEIVE MESSAGE (Conn ID, Timeout = Infinite)

if Command = Disconnect

Issue DISCONNECT event // internal event

* Wait (N) mSec

ConnID = NaN

Flag = FALSE

else

Handle other command

Flag = TRUE

end if

CONNECTED LIGHT = FLAG

while RUNNING = TRUE

The *WAIT was inserted just for purposes of this question.

The basic logic is :

If there is no connection

Try for 1000 mSec to make one

else

Receive messages

end if

The pattern worked fine, without the ObjHighlight( ), for quite a while.

I decided I wanted to draw user attention to the fact that the connection was made, so I inserted the ObjHighlight( ) where the connection was first made.

That works fine. Regardless of whether the host (this code) or the other end starts first, this code makes the connection, and now highlights the light when connecting.

But what I don’t understand is that the highlight ALSO happens when the other end BREAKS the connection.

If I kill the other end, then the light highlights, and then is OFF.

I have verified that there is one and only one ObjHighlight( ) node, and that it is indeed where I have listed it above. A breakpoint verifies that it is indeed called when the connection is BROKEN.

The RECEIVE MESSAGE subVI waits on a message on the given ConnID, and turns an error into a DISCONNECT command. It also sets the ConnID to NaN, in case of error.

So, this means that, when the other end breaks:

— The RECEIVE MESSAGE subVI sees the break, quits waiting, and reports a DISCONNECT command

— The DISCONNECT command turns off the light.

— The loop cycles

— We try to connect again

— We SUCCEED in connecting again ( why? )

— We highlight the light and turn it on, because we connected.

— We try to receive

— We fail, with a DISCONNECT command

— We turn off the light.

What I don’t understand is why the re-try succeeds. I break the other end by ABORTing the program that is running (on the PXI).

So, how is it that a stopped program can allow a new connection?

Now if I use the WAIT function with N = 0..125, nothing changes.

But if N = 150 or more, it does not exhibit this behavior. The light does NOT highlight when the connection is broken.

I can infer from that, that the system on the other end hasn’t fully reacted to the ABORT and still has the “ringer turned on” so that it can still hear an incoming call.

Somewhere between 125 and 150 mSec, it gets around to shutting off the ringer and all behaves as expected.

Oddly enough, if i insert an ObjHighlight( ) call in the DISCONNECT case, that serves to delay things enough to where it works right, but I don't want to depend on that.

--------------------

So, is this a bug in the PXI’s OpSys? Or just a natural consequence of the fact that it cannot do everything at once? Or is there something I’m missing?

I don’t mind highlighting the light on connection breakage; it’s the extra CONNECT/DISCONNECT events that I want to avoid.

And if I have to wait a while, how long do I have to wait? Is the 150 mSec a property of this particular model (PXI-8196), or LVRT2013, or what?

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

nathand · ‎12-01-2014

CoastalMaineBird wrote:

What I don’t understand is why the re-try succeeds. I break the other end by ABORTing the program that is running (on the PXI).
So, how is it that a stopped program can allow a new connection?

--------------------

So, is this a bug in the PXI’s OpSys? Or just a natural consequence of the fact that it cannot do everything at once? Or is there something I’m missing?

I don’t mind highlighting the light on connection breakage; it’s the extra CONNECT/DISCONNECT events that I want to avoid.

And if I have to wait a while, how long do I have to wait? Is the 150 mSec a property of this particular model (PXI-8196), or LVRT2013, or what?

I suspect the correct solution here is not to ABORT the program on the RT side, but rather to provide a way for it to exit cleanly. It's not the program that's accepting a new connection, it's the operating system. My guess is that it takes a while for the operating system to clean up everything when you stop the program. Have you considered running Wireshark to capture the TCP data and watch the sequence that occurs when you abort the RT program? Try it, and post the log - I think it will help explain what's happening.

CoastalMaineBird · ‎12-01-2014

I suspect the correct solution here is not to ABORT the program on the RT side, but rather to provide a way for it to exit cleanly.

Well, yes, but I need to guard against contingencies.

Have you considered running Wireshark to capture the TCP data and watch the sequence that occurs when you abort the RT program?

OK, I haven't used Wireshark and am not used to chasing things to that level of detail, but attached is the log you mentioned.

There's a ton of data there, but only an ounce of information, and I don't know how to dig it out.

I suspect that it's just that the OpSys on the PXI doesn't get notified that the program has quit until 130+ mSec later, but the host is able to notice the connection death and retry it before then.

The only solution I see is to wait for something longer than any reasonable (heh!) system would take (I have to handle multiple models of controllers ) before attempting to reconnect.

Not difficult to do, but I need to know what I'm dealing with.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

nathand · ‎12-01-2014

You're right, there's too much information in that log - I can't quickly see what's going on and don't have time to dig into it. If possible I'd suggest that you not start the log until just before you abort the RT application, and perhaps that's what you've already done - although looking at the log I can see a transfer of what looks like the ni-rt.ini file as well as an XML configuration file, so I'm guessing that you started the log earlier.

CoastalMaineBird · ‎12-01-2014

You're right, there's too much information in that log

What I said was there's too much data and not enough information. 😉

Yeah don't mess with it.

I forgot to detail what I had done:

--- I reverted the code to show the problem again.

--- I started the host

--- I started the WireShark log

--- I started the PXI program.

--- I aborted the PXI program.

--- I stopped the log.

"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." --- Clifford Stoll

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

GregFreeman · ‎12-01-2014

Not sure, but could the issues in this LAVA thread with regards to the NaN/Path/Ref have to do with it? My guess is no just glancing at things here (nothing looks to be in parallel), but I wanted to bring it up just in case.

Edit: read things a little closer and now I see what you're saying. This, then, isn't the issue. But, I will leave the link anyways.

CoastalMaineBird · ‎12-01-2014

Thanks for the link, GregFreeman, but I don't think that is involved. I don't have a race condition, in the classic sense. I have no parallel thread to look at the RefNum.

Although it is worth knowing that the Not-A-Refnum operator actually validates the thing (taking time), it's not just testing for 0, like I thought.

Thanks.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

GregFreeman · ‎12-01-2014

@CoastalMaineBird wrote:

Thanks for the link, GregFreeman, but I don't think that is involved. I don't have a race condition, in the classic sense. I have no parallel thread to look at the RefNum.

Although it is worth knowing that the Not-A-Refnum operator actually validates the thing (taking time), it's not just testing for 0, like I thought.

Thanks.

Yes, and the fact that between the time you check the refnum and the time your case structure executes, the refnum could have gone invalid.

Oligarlicky · ‎12-01-2014

Not a refnum just tells you if you have a valid refnum, it doesn't tell you you have a valid connection. From memory doing a TCP read is a pretty good indicator and will give you an error on bad connections.

CoastalMaineBird · ‎12-01-2014

Not a refnum just tells you if you have a valid refnum, it doesn't tell you you have a valid connection.

Sure, I understand that. What I didn't know was that the test for Not-a-RefNum is NOT a simple test for zero, but a search thru some table somewhere.

It doesn't pertain to my orginal problem but it's interesting to know, nonetheless.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

LabVIEW

TCP connection break / re-connect issue

TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue

Re: TCP connection break / re-connect issue