LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

TCP connection break / re-connect issue

Solved!
Go to solution

LV2013 on Win7, LVRT2013 on PXI-8196


Well, I thought I knew all there is to know about TCP connections - I’ve been using them for years, but I have found a new wrinkle.
 
I’m using a pattern I’ve used before, except this time I’m trying to automatically re-establish a broken connection:
 
ConnID = NaN
repeat
    if ConnID = NaN        // if there is no connection...
        ConnID = TCP OPEN CONNECTION (Address, Port, Timeout = 1000)
        if ConnID = NaN    // could not connect
            Flag = FALSE    // make sure light is off
        else
            Issue CONNECT event (Conn ID) // internal event
            CONNECTED LIGHT. ObjHighlight ( ); // draw attention to the change in status
            Flag = TRUE   // make sure light is on
        end if
    else
         Command = RECEIVE MESSAGE (Conn ID, Timeout = Infinite)
         if Command = Disconnect
            Issue DISCONNECT event // internal event
            *  Wait (N) mSec
            ConnID = NaN
            Flag = FALSE
         else
             Handle other command
             Flag = TRUE
         end if
    end if
    CONNECTED LIGHT = FLAG
while RUNNING = TRUE
 
The *WAIT was inserted just for purposes of this question.
 
The basic logic is :
If there is no connection
    Try for 1000 mSec to make one
else
     Receive messages
end if
 
The pattern worked fine, without the ObjHighlight( ), for quite a while.
 
I decided I wanted to draw user attention to the fact that the connection was made, so I inserted the ObjHighlight( ) where the connection was first made.
 
That works fine. Regardless of whether the host (this code) or the other end starts first, this code makes the connection, and now highlights the light when connecting.
 
But what I don’t understand is that the highlight ALSO happens when the other end BREAKS the connection.
 
If I kill the other end, then the light highlights, and then is OFF.
 
I have verified that there is one and only one ObjHighlight( ) node, and that it is indeed where I have listed it above.  A breakpoint verifies that it is indeed called when the connection is BROKEN.
 
The RECEIVE MESSAGE subVI waits on a message on the given ConnID, and turns an error into a DISCONNECT command. It also sets the ConnID to NaN, in case of error.
 
So, this means that, when the other end breaks:
— The RECEIVE MESSAGE subVI sees the break, quits waiting, and reports a DISCONNECT command
— The DISCONNECT command turns off the light.
— The loop cycles
— We try to connect again
— We SUCCEED in connecting again ( why? )
— We highlight the light and turn it on, because we connected.
— We try to receive
— We fail, with a DISCONNECT command
— We turn off the light.
 
What I don’t understand is why the re-try succeeds.  I break the other end by ABORTing the program that is running  (on the PXI).
So, how is it that a stopped program can allow a new connection?
 
Now if I use the WAIT function with N = 0..125, nothing changes.
But if N = 150 or more, it does not exhibit this behavior.  The light does NOT highlight when the connection is broken.
 
I can infer from that, that the system on the other end hasn’t fully reacted to the ABORT and still has the “ringer turned on” so that it can still hear an incoming call.
 
Somewhere between 125 and 150 mSec, it gets around to shutting off the ringer and all behaves as expected.
 
Linkage 1.PNG
 
Linkage 2.PNG
 
 
Oddly enough, if i insert an ObjHighlight( ) call in the DISCONNECT case, that serves to delay things enough to where it works right, but I don't want to depend on that.

--------------------
 
So, is this a bug in the PXI’s OpSys?  Or just a natural consequence of the fact that it cannot do everything at once? Or is there something I’m missing?
 
I don’t mind highlighting the light on connection breakage; it’s the extra CONNECT/DISCONNECT events that I want to avoid.
 
And if I have to wait a while, how long do I have to wait?  Is the 150 mSec a property of this particular model (PXI-8196), or LVRT2013, or what?
 
 
Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com


Blog for (mostly LabVIEW) programmers: Tips And Tricks

0 Kudos
Message 1 of 19
(3,601 Views)

CoastalMaineBird wrote:
What I don’t understand is why the re-try succeeds.  I break the other end by ABORTing the program that is running  (on the PXI).
So, how is it that a stopped program can allow a new connection?
--------------------
So, is this a bug in the PXI’s OpSys?  Or just a natural consequence of the fact that it cannot do everything at once? Or is there something I’m missing?
 
I don’t mind highlighting the light on connection breakage; it’s the extra CONNECT/DISCONNECT events that I want to avoid.
 
And if I have to wait a while, how long do I have to wait?  Is the 150 mSec a property of this particular model (PXI-8196), or LVRT2013, or what?

I suspect the correct solution here is not to ABORT the program on the RT side, but rather to provide a way for it to exit cleanly. It's not the program that's accepting a new connection, it's the operating system. My guess is that it takes a while for the operating system to clean up everything when you stop the program. Have you considered running Wireshark to capture the TCP data and watch the sequence that occurs when you abort the RT program? Try it, and post the log - I think it will help explain what's happening.

0 Kudos
Message 2 of 19
(3,584 Views)

I suspect the correct solution here is not to ABORT the program on the RT side, but rather to provide a way for it to exit cleanly.

 

Well, yes, but I need to guard against contingencies.

 

Have you considered running Wireshark to capture the TCP data and watch the sequence that occurs when you abort the RT program?

 

OK, I haven't used Wireshark and am not used to chasing things to that level of detail, but attached is the log you mentioned.

 

There's a ton of data there, but only an ounce of information, and I don't know how to dig it out.

 

I suspect that it's just that the OpSys on the PXI doesn't get notified that the program has quit until 130+ mSec later, but the host is able to notice the connection death and retry it before then. 

 

The only solution I see is to wait for something longer than any reasonable (heh!) system would take (I have to handle multiple models of controllers ) before attempting to reconnect.

 

Not difficult to do, but I need to know what I'm dealing with.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com


Blog for (mostly LabVIEW) programmers: Tips And Tricks

0 Kudos
Message 3 of 19
(3,575 Views)

You're right, there's too much information in that log - I can't quickly see what's going on and don't have time to dig into it. If possible I'd suggest that you not start the log until just before you abort the RT application, and perhaps that's what you've already done - although looking at the log I can see a transfer of what looks like the ni-rt.ini file as well as an XML configuration file, so I'm guessing that you started the log earlier.

0 Kudos
Message 4 of 19
(3,562 Views)

You're right, there's too much information in that log

 

What I said was there's too much data and not enough information. 😉

 

Yeah don't mess with it.

I forgot to detail what I had done:

--- I reverted the code to show the problem again.

--- I started the host

--- I started the WireShark log

--- I started the PXI program.

--- I aborted the PXI program.

--- I stopped the log.

 

"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." --- Clifford Stoll

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com


Blog for (mostly LabVIEW) programmers: Tips And Tricks

Message 5 of 19
(3,558 Views)

Not sure, but could the issues in this LAVA thread with regards to the NaN/Path/Ref have to do with it? My guess is no just glancing at things here (nothing looks to be in parallel), but I wanted to bring it up just in case.

 

Edit: read things a little closer and now I see what you're saying. This, then, isn't the issue. But, I will leave the link anyways.

0 Kudos
Message 6 of 19
(3,540 Views)

Thanks for the link, GregFreeman, but I don't think that is involved.  I don't have a race condition, in the classic sense.  I have no parallel thread to look at the RefNum.

 

Although it is worth knowing that the Not-A-Refnum operator actually validates the thing (taking time), it's not just testing for 0, like I thought.

 

Thanks.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com


Blog for (mostly LabVIEW) programmers: Tips And Tricks

0 Kudos
Message 7 of 19
(3,528 Views)

@CoastalMaineBird wrote:

Thanks for the link, GregFreeman, but I don't think that is involved.  I don't have a race condition, in the classic sense.  I have no parallel thread to look at the RefNum.

 

Although it is worth knowing that the Not-A-Refnum operator actually validates the thing (taking time), it's not just testing for 0, like I thought.

 

Thanks.


Yes, and the fact that between the time you check the refnum and the time your case structure executes, the refnum could have gone invalid. 

0 Kudos
Message 8 of 19
(3,524 Views)

Not a refnum just tells you if you have a valid refnum, it doesn't tell you you have a valid connection. From memory doing a TCP read is a pretty good indicator and will give you an error on bad connections.

0 Kudos
Message 9 of 19
(3,509 Views)

Not a refnum just tells you if you have a valid refnum, it doesn't tell you you have a valid connection.

 

Sure, I understand that.  What I didn't know was that the test for Not-a-RefNum is NOT a simple test for zero, but a search thru some table somewhere.

 

It doesn't pertain to my orginal problem but it's interesting to know, nonetheless.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com


Blog for (mostly LabVIEW) programmers: Tips And Tricks

0 Kudos
Message 10 of 19
(3,501 Views)