Failure modes in TCP WRITE?

CoastalMaineBird · ‎04-15-2011

do you see this issue crop up in instances when the CPU is not maxed out at 100%?

Well, I can't tell directly how loaded the CPU is. It's 1800 miles away from me at the moment.

The normal load on the CPU is between 2% and 5%, I've measured it consistently that way. That's the AVERAGE with 100 HZ PID loop and 10Hz collecting everything else, and sending 1600-byte blocks every 100 mSec. Not bad. But an AVERAGE isn't killing me, it's the PEAK I think is killing me.

And I don't have a precise picture of where the trouble starts. I have files from about 30 tests. I do know that a 20 or 40-minute test goes without a hitch, every time.

Within 70-90 seconds after the end of that test, they do a background measurement. That goes without a hitch, every time.

Between 60 and 180 seconds after THAT, I see spurious entries in the log: results from operations that weren't requested, and such, with garbage numbers. Not every time, but it failed about 10 times out of that 30, and it fails in that time frame. But that won't happen until I get a loggable message header (only about 60 OpCodes of the 255 are used, only about 20 of those would trigger a log entry). So it could have been off track for 2 minutes before the log shows it.

I suspect that they decide they're done with calibration and all they hit the DONE button. At that point, ALL domains are told to STOP SAMPLING, then ALL domains are told to SHUT DOWN.

My DAQ structure is such that that one command gets sent to 12-15 domain managers IN PARALLEL. That means SCXI, AO, DO, DI, COUNTER, NI-CAN, tasks are being told to shut down, UDP Receivers are shut down, and lots of memory is being de-allocated, and all sorts of cleanup is going on.

I suspect that THAT is the CPU pressure that causes the TCP WRITE to surpass the lame TIMEOUT value I gave it, and if it's midway thru a message packet, then I am hosed.

At least, that scenario fits all the facts that I have.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Mark_Yedinak · ‎04-15-2011

@Ben wrote:

@jarrod S. wrote:

If a TCP Write operation times out, it is possible that some of the data did indeed get put in the buffer and will be read by the other side. This is why there is a Bytes Written output on the TCP Write function, so you can determine what actually got put in the buffer.

To account for this, you can do the following:

1. Perform another TCP Write and send only the subset of the first packet that didn't get fully transmitted. Use Bytes Written wired into Get String Subset to get the remaining data.

2. Start with bigger timeouts.

3. In case of timeout, close the connection and force a reconnect so that the partially filled buffer data doesn't get processed by the other side.

Hi Jarrod,

What you wrote contradicts the TCP/IP spec that guarantees packet delivery intact (receiver must ACK message before sender considers it sent otherwise it should retry).

For what you said to be true then the implementation of TCP/IP in rt is buggy.

Ben

PS: I just did a Google search on TCP/IP guarantee and I am not alone in my impression that packets are guaranteed to be delivered intact and in order.

Ben,

What you are saying is not quite right. Actually both you and Jarrod are correct. You have to specify at what level you are talking about the data and the acknowledgements. At the TCP layer it is possible to receive a partial packet and the not have a buggy RT as you say. If you try to send a large chunk of data with a 1 ms timeout you almost guarantee that you will get a write timeout. X number of bytes will be transmitted and chances are that at the TCP layer all of the data will be acknowledged. Acknowledgements at the TCP layer have no concept of the total data length and can only guarantee delivery of the data that got to the wire. When you set the write timeout you are telling the stack that it has that much time to deliver all of the data you gave it. When the timeout occurs any data that didn't make it to the wire is tossed by the stack. At the transport layer the receiver can acknowledge receipt of all of the data that made it to the wire and the sender will think everything is fine at the transport layer. Both ends of the connection will be in sync at up to date at the transport layer.

At the application layer however the receiver will have incomplete data and the sender will also (provided it checked for an error on the write) will know that only a partial transfer has occurred. It would be up to the application layer to implement any type of recovery for the data.

A 1 ms timeout is very aggressive in terms of networking. If you are on a local subnet this may work fine. If you are going over a larger network you can easily hit this timeout, especially if your data is more than one TCP packet. This can also be an issue on a very busy network.

As for the definition of a packet it depends again on whether you are talking about the application's idea of a packet or TCP's definition of a packet. The protocol used at the application layer will define what it thinks are packets. At the TCP layer the packet is literally what gets put on the wire. You could actually have multiple application layer packets in one TCP packet or and application layer's packet may span multiple TCP packets.

Since you are sending small amounts of data you can also run into latency issues if you have the network's Nagel algorithm enabled. This algorithm tries to send as much data as possible in a TCP packet and will actually delay putting data on the wire for a brief time if the data it is instructed to send is smaller than a single TCP packet. The network stack will try to send complete TCP packets.

On your receiver's side I would use your 0 timeout read for only the packet header. The following read for the x bytes should have a timeout. That way if the sender stalls your read will timeout and you can safely throw that message away. If the sender has timed out on its write the remaining data on that end will also get dropped. Your next read (waiting for the packet header) should get you back in sync.

Mark Yedinak
Certified LabVIEW Architect
LabVIEW Champion

"Does anyone know where the love of God goes when the waves turn the minutes to hours?"
Wreck of the Edmund Fitzgerald - Gordon Lightfoot

CoastalMaineBird · ‎04-15-2011

On your receiver's side I would use your 0 timeout read for only the packet header. The following read for the x bytes should have a timeout. That way if the sender stalls your read will timeout and you can safely throw that message away. If the sender has timed out on its write the remaining data on that end will also get dropped.

I don't see how to make that work in my case. I don't know when the next message occurs. There are times when I send 3-4 messages in a row. If one gets PARTIALLY tossed by the xmit end, then the next one comes across, and I'm waiting on a payload, so I'll take it, and I'm hosed. I would have to implement some sort of minimum message spacing to make that work.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Mark_Yedinak · ‎04-15-2011

@CoastalMaineBird wrote:

On your receiver's side I would use your 0 timeout read for only the packet header. The following read for the x bytes should have a timeout. That way if the sender stalls your read will timeout and you can safely throw that message away. If the sender has timed out on its write the remaining data on that end will also get dropped.

I don't see how to make that work in my case. I don't know when the next message occurs. There are times when I send 3-4 messages in a row. If one gets PARTIALLY tossed by the xmit end, then the next one comes across, and I'm waiting on a payload, so I'll take it, and I'm hosed. I would have to implement some sort of minimum message spacing to make that work.

No, you wouldn't have to worry about this. On the sender's side if all is well all of your data would be sent even for your multiple messages. You will only encounter a timeout and missing data for the last message. I would disable the Nagel algoithim on the sender's side which would guarantee each message is sent in it's own packet. If a timeout occurs only this packet will be dropped and the next message will be delayed because the data on your sending side is being sent serially. You can extend your timeout on your sending side to 10 or 20 ms. When all is going well your sending side will not take this long. You will only encounter the timeout when there are issues such as the TCP thread being blocked because of the CPU usage. If your receiver uses the same timeout you will only get the error on the message that was affected. Your next message will already have been delayed past this on the sender's side.

There is still a slight chance you can get out of sync but you will greatly reduce your chances using this approach. If you wait infinitely for the packet payload you have a much greater chance of getting out of sync because you will start to merge messages.

Mark Yedinak
Certified LabVIEW Architect
LabVIEW Champion

"Does anyone know where the love of God goes when the waves turn the minutes to hours?"
Wreck of the Edmund Fitzgerald - Gordon Lightfoot

gsussman · ‎04-15-2011

Based on my experience with RT and TCP communications, if the TCP thread gets suspended, even for a short period, then all bets are off as to what the outcome will be when it comes back on line.

I have seen issues where the connection will come back and miss a couple of bytes (this is when the Tx and Rx fall out of sync).

In other cases, the connection is dropped completely (client side goes into TIME_WAIT), and probably the most insidious was where the connection between the RTOS and LVRT silently failed (the client did not recieve closed connection errors and was able to continue transmitting data, on the RT side, the TCP read functions returned error 56 (timeout), and never reported any connection errors. A NETSTAT on the client showed the socket still in the ESTABLISHED condition.

When I first started doing RT/TCP systems years ago I ran into this problem and went round and round trying to find a solution. Due to the many different failure modes, what I came up with was the following:

1. On any RT system, where a TCP connection is established, there will always be a "heartbeat" signal from the client to the RT system when there is no "active" traffic going on. If no data is received within the timeout (typically 2-5 sec), I assume the connection has been FUBAR'd and the TCP socket is closed on the RT side and a new listener established that will wait for the client to reconnect.

a. Every open socket connection was requried to have a heartbeat signal implemented

b. This addressed the issue of the broken RTOS/LVRT connection

2. All communications were converted to CRLF delimited and B64 encoded.

a. This addressed the issue of falling out of sync

b. I realize that I sacrifice bandwidth by doing this (~33% greater packet size)

I would think that if you were running a Gigabit network then your system should be able to handle the extra bandwidth of B64 encoding

Mark_Yedinak · ‎04-15-2011

@gsussman wrote:

I would think that if you were running a Gigabit network then your system should be able to handle the extra bandwidth of B64 encoding

Another thought regarding your connection is to implement a CRC on the data coming across the socket. By virtue of the way TCP works, the communication is guaranteed so any CRC failure in the data payload should be the result of the RT side loosing sync. Under that assumption, you could close and re-estblish the connection and carry on from there.

The performance hit you may take for using B64 encoding may not be the network itself but the data conversion on the system.

I was also thinking of suggesting the use of a CRC for message validation. It doesn't necessarily help you get back in sync but will help to avoid processing garbage. Your suggestion of closing the connection and forcing the sender to open a new one helps resolve that issue.

You can also put in some validation when reading the header. If you know that your largest message will be X bytes, you could test your message size and throw an error (possibly close the connection as suggested) and not try to read some obscenely large amount of data.

Mark Yedinak
Certified LabVIEW Architect
LabVIEW Champion

"Does anyone know where the love of God goes when the waves turn the minutes to hours?"
Wreck of the Edmund Fitzgerald - Gordon Lightfoot

gsussman · ‎04-15-2011

.....and not try to read some obscenely large amount of data.

I normally limit the allowed size of the TCP read to preclude just this issue. When running in debug mode (LV dev system connected) I often saw the "Not enough memory to complete operation" dialog when the TCP link fell out of sync. Unfortunately if the TCP thread got suspended I normally got the "Connection to the RT system has been lost" dialog first.

I think that this is the same reason that Steve is seeing the massively large or small numbers in the received data stream.

CoastalMaineBird · ‎05-20-2011

For what it's worth, this issue has been diagnosed as a faulty NIC, of all things. Replacing the network adapter resulted in perfect operation, every time.

I have left the increased TIMEOUT in as a general principle, but I finally get to blame the hardware guys 😉

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

LabVIEW

Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?