Failure modes in TCP WRITE?

CoastalMaineBird · ‎04-15-2011

I need help diagnosing an issue where TCP communications breaks down between my host (Windows) and a PXI (LabVIEW RT 2010).

The bottom-line questions are these:

1...Are there circumstances in which TCP WRITE, given a string of say, 10 characters, will write more than zero and fewer than 10 characters to the connection? If so, what are those circumstances?

2...Is it risky to use a timeout value of 1 mSec? Further thought seems to say that I won't get a 1000 uSec timeout if we're using a 1-mSec timebase, but I don't know if that's true in the PXI.

Background:

On the PXI, I'm running a 100-Hz PID loop, controlling an engine. I measure the speed and torque, and control the speed and throttle. Along the way, I'm measuring 200 channels of misc stuff (analog, CAN, TCP instruments) at 10 Hz and sending gobs of info to the host (200 chans * 8 = 1600 bytes every 0.1 sec)

The host sends commands, the PXI responds.

The message protocol is a fixed-header, variable payload type: a message is a fixed 3-byte header, consisting of a U8 OpCode, and a U16 PAYLOAD SIZE field. I flatten some structure to a string, measure its size, and prepend the header and send it as one TCP WRITE. I receive in two TCP READs: one for the header, then I unflatten the header, read the PAYLOAD SIZE and then another read for that many more bytes.

The payload can thus be zero bytes: a TCP READ with a byte count of zero is legal and will succeed without error.

A test starts with establishing a connection, some configuration stuff, and then sampling starts. The 10-Hz data stream is shown on the host screen at 2-Hz as numeric indicators, or maybe selected channels in a chart.

At some point the user starts RECORDING, and the 10-Hz data goes into a queue for later writing to a file. This is while the engine is being driven thru a prescribed cycle of speed/torque target points.

The recording lasts for 20 or in some cases 40 minutes (24000 samples) and then recording stops, but sampling doesn't. Data is still coming in and charted. The user can then do some special operations, related to calibration checks and leak checks, and those results are remembered. Finally, they hit the DONE button, and the whole mess gets written to a file.

All of this has worked fine for several years, but as the system is growing (more devices, more channels, more code), a problem has cropped up: the two ends are occasionally getting out of synch.

The test itself, and all the configuration stuff before, is working perfectly. The measurement immediately after the test is good. At some point after that, it goes south. The log shows the PXI sending results for operations that were not requested. The data in those results is garbage; 1.92648920e-299 and such numbers, resulting from interpreting random stuff as a DBL.

After I write the file, the connection is broken, the next test re-establishes it, and all is well again.

In chasing all this, I've triple-checked that all my SENDs are MEASURING the size of the payload before sending it. Two possibilities have come up:

1... There is a message with a payload over 64k. If my sender were presented with a string of length 65537, it would convert that to a U16 of value 1, and the receiver would expect 1 byte. The receiver would then expect another header, but this data comes instead, and we're off the rails.

I don't believe that's happening. Most of the messages are fewer than 20 bytes payload, the data block is 1600 or so, I see no evidence for such a thing to happen.

2... The PXI is failing, under certain circumstances, to send the whole message given to TCP WRITE. If it sent out a header promising 20 more bytes, but only delivered 10, then the receiver would see the header and expect 20 more. 10 would come immediately, but whatever the NEXT message was, it's header would be construed as part of the payload of the first message, and we're off the rails.

Unfortunately, I am not checking the error return from TCP write, since it never failed in my testing here (I know, twenty lashes for me).

It also occurs to me that I am giving it a 1-mSec timeout value, since I'm in a 100-Hz loop. Perhaps I should have separated the TCP stuff into a separate thread. In any case, maybe I don't get a full 1000 uSec, due to clock resolution issues.

That means that TCP WRITE cannot get the data written before the TIMEOUT expires, but it has written part of it.

I suspect, but the logs don't prove, that the point of failure is when they hit the DONE button. The general CPU usage on the PXI is 2-5% but at that point there are 12-15 DAQ domain managers to be shutting down, so the instantaneous CPU load is high. If that happens to coincide with a message going out, well, maybe the problem crops up. It doesn't happen every time.

So I repeat the two questions:

1...Are there circumstances in which TCP WRITE, given a string of say, 10 characters, will write more than zero and fewer than 10 characters to the connection? If so, what are those circumstances?

2...Is it risky to use a timeout value of 1 mSec? Further thought seems to say that I won't get a 1000 uSec timeout if we're using a 1-mSec timebase, but I don't know if that's true in the PXI.

Thanks,

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Jarrod_S. · ‎04-15-2011

If a TCP Write operation times out, it is possible that some of the data did indeed get put in the buffer and will be read by the other side. This is why there is a Bytes Written output on the TCP Write function, so you can determine what actually got put in the buffer.

To account for this, you can do the following:

1. Perform another TCP Write and send only the subset of the first packet that didn't get fully transmitted. Use Bytes Written wired into Get String Subset to get the remaining data.

2. Start with bigger timeouts.

3. In case of timeout, close the connection and force a reconnect so that the partially filled buffer data doesn't get processed by the other side.

Jarrod S.
National Instruments

gsussman · ‎04-15-2011

There are a couple of issues at play here, and both are working together to cause your issue(s).

1) LV RT will suspend the TCP thread when your CPU utilization goes up to 100%. When this happens, your connection to the outside world simply goes away and your communications can get pretty screwed up. (More here)

Unless you create some form of very robust resend and timeout strategy your only other solution would be to find a way to keep your CPU from maxing out. This may be through the use of some scheduler to limit how many processes are running at a particular time or other code optimization. Any way you look at it, 100% CPU = Loss of TCP comms.

2) The standard method of TCP communication shown in all examples I have seen to date uses a similar method to transfer data where a header is sent with the data payload size to follow.

On the Rx side, the header is read, the payload size extracted then a TCP read is set with the desired size. Under normal circumstances this works very well and is a particularly efficent method of transferring data. When you suspend the TCP thread during a Rx operation, this header can get corrupted and pass the TCP Read a bad payload size due to a timeout on the previous read. As an example the header read expects 20 bytes but due to the TCP thread suspension only gets 10 before the timeout. The TCP Read returns only those 10 bytes, leaving the other 10 bytes in the Rx buffer for the next read operation. The subsequent TCP Read now gets the first 2 bytes from the remaining data payload (10 bytes) still in the buffer. This gives you a further bad payload read size and the process continues OR if you happen to get a huge number back, when you try to allocate a gigantic TCP receive buffer, you get an out of memory error.

The issue now is that your communications are out of sync. The Rx end is not interpeting the correct bytes as the header thus this timeout or bad data payload behavior can continue for quite a long time. I have found that occasionally (although very rare) the system will fall back into sync however it really is a crap shoot at this point.

I more robust way of dealing with the communication issue is to change your TCP read to terminate on a CRLF as opposed to the number of bytes or timeout (The TCP Read has an enum selctor for switching the mode. In this instance, whenever a CRLF is seen, the TCP Read will immediately terminate and return data. If the payload is corrupted, then it will fail to be parsed correctly or would encounter a checksum failure and be discarded or a resend request issued. In either case, the communications link will automatically fall back into sync between the Tx and Rx side. The one other thing that you must do is to encode your data to ensure that no CRLF characters exist in the payload. Base64 encode/decode works well. You do give up some bandwith due to the B64 strings being longer, however the fact that the comm link is now self syncing is normally a worthwhile sacrifice.

When running on any other platform other than RT, the <header><payload> method of transmitting data works fine as TCP guarantees transmission of the data, however on RT platforms due to the suspension of the TCP thread on high CPU excursions this method fails miserably.

CoastalMaineBird · ‎04-15-2011

LV RT will suspend the TCP thread when your CPU utilization goes up to 100%.

I suspect this is what's happening. At the time they click the DONE button, I have 15 or so "domain" managers told to shutdown, all at one time. The SCXI domain clears a NI-DAQ task, the CAN domain clears a NI-CAN task, various NI-DAQ tasks are stopped. I suspect they are hogging the CPU and the TCP WRITE gets partially done.

My RECEIVE side logic is different from what you suggest in #2, though. I use a TCP READ, in BUFFERED mode, with a timeout of 0. That way, I get all or nothing: I don't see a partial message. In effect, there is no timeout, so there is no issue there.

To me, it seems like a lot of sacrifice to move to B64 or CRLF terminated strings. I have lots of net traffic to consider - the host is talking to four gas analyzers, the gas analyzers are sending their data (in ASCII) at 10 Hz to the PXI, there are four other units sending data at 10 Hz in binary, it's not just one host and one PXI talking.

I'm sending binary data - the probability of a CRLF being embedded in there is quite high (in this context, a 0.1% probability is quite high). So I would HAVE to encode and decode that into B64 or something. That increases both CPU load and net traffic. While it's true that it would recover sync, I need to find a way to avoid the problem in the first place.

I have a request in for the client to increase that timeout from 1 mSec to 3. That will guarantee me a timeout of 2000 uSec, I think. If that fixes the problem, then what I'll do is stagger the shutdown tasks, such that they don't all occur in parallel. I really don't care if they shut down in 10 mSec or 10 seconds. But the nature of the system means they're all called in parallel, so I guess it hits the CPU pretty hard.

Anyway, thanks for your thoughts.

Given the response from NI that it is indeed possible for TCP WRITE to send PART OF the string I give it, this is seeming like it is the heart of the problem.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Ben · ‎04-15-2011

@jarrod S. wrote:

If a TCP Write operation times out, it is possible that some of the data did indeed get put in the buffer and will be read by the other side. This is why there is a Bytes Written output on the TCP Write function, so you can determine what actually got put in the buffer.

To account for this, you can do the following:

1. Perform another TCP Write and send only the subset of the first packet that didn't get fully transmitted. Use Bytes Written wired into Get String Subset to get the remaining data.

2. Start with bigger timeouts.

3. In case of timeout, close the connection and force a reconnect so that the partially filled buffer data doesn't get processed by the other side.

Hi Jarrod,

What you wrote contradicts the TCP/IP spec that guarentees pcaket delivery intact (reciever must ack message before sender conciders it sent otherwise it should rety).

For what you said to be true then the implementation of TCP/IP in rt is buggy.

Ben

PS: I just did a Google serach on TCP/IP guarentee and I am not alone in my impresion that pcakets are guarenteed to be delivered intact and in order.

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

Ben · ‎04-15-2011

See this Wiki article on TCP/IP

In the second paragraph under the section "Layers in the TCP/IP model" you will find the phrase;

" TCP provides both data integrity and delivery guarantee (by retransmitting until the receiver acknowledges the reception of the packet)."

So again, if RT is doing passing incomplete packets, it is buggy.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

CoastalMaineBird · ‎04-15-2011

Ben - the question that raises is "What is the definition of a packet?"

I sometimes send a string of 30k-40k characters. Is that a "packet"? I neither know nor care whether the low-level stuff splits that up into smaller chunks for its own purposes. It's possible that the when I ask to send a 20-char string, it sends a legal packet of 12, gets interrupted, and tells me it sent 12. The 12 is a legal packet and received and acknowledged by the other side.

I'm not an expert on the spec, but that would seem to be legal.

If that (illegal packet) were the case, the receiver side (on WIndows) should reject the whole thing, and there would be no sync problem. I would miss the message, but sync would be maintained, wouldn't it?

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

Ben · ‎04-15-2011

Hi Steve,

I will doing any of my own specualting on what a packet is when defined in the context of the API NI exposes for the TCP/IP stack.

But regarding your issue:

I suspect that packet is not being acked within the timeout and since you are not checking for same and retrying the reciever is falling out of sequence. Bumping up the Time Out value will not make it run any slower but will give more time to complete the work. But even with a longer timeout, checkiing for succcess will at let you know which end of the wire requires your attention.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

CoastalMaineBird · ‎04-15-2011

Thanks, Ben

I have asked the client to bump up the timeout to 3 msec. I suspect that a 1 mSec could fail because of resolution: if I start at T=10uSec before the mSec tick, and check it 20 uSec later, the timeout could be expired. It's just a resolution issue.

If the increased timeout fixes the issue (or at least relieves it) then, I will stagger the SHUTDOWN tasks so they don't all hit the CPU at once.

Steve Bird
Culverson Software - Elegant software that is a pleasure to use.
Culverson.com

Blog for (mostly LabVIEW) programmers: Tips And Tricks

gsussman · ‎04-15-2011

Steve....do you see this issue crop up in instances when the CPU is not maxed out at 100%?

LabVIEW

Failure modes in TCP WRITE?

Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?

Re: Failure modes in TCP WRITE?