BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Danny_L · ‎02-06-2015

Setup :

* STM Server running on CRIO-9068 (LabVIEW RT)

* STM Client running on Windows (LabVIEW, Vb.Net)

If for some reasons, the STM Client can't keep up with the streaming rate of the server, TCP packet will eventually fill the server TCP stack send buffer/queue. When this happen, non-blocking "TCP Write.vi" is subject to partial write i.e. part of the message passed to "STM Write.vi" will be written before "STM Write.vi" returned with a timeout. When this happen, the STM stream between the client and server become desynchronize, leading to the client stoping on "STM errors" or "out of memory errors".

This is not happing when the server runs of windows as it seems windows version of non-blocking "TCP Write.vi" don't do partial write: either the complete buffer pass to "TCP Write.vi" is written or none at all.

Running the server on a CRIO-9075 doesn't have any issues. Which leads me to believe this issue exist only on linux based CRIO. But I don't have any other devices to verify this theory.

A search for "linux tcp partial write" on google returns a few link where setting the socket option "SOCK_SEQPACKET" should disable partial write on non blocking socket. Unfortunatly, there is no way as far as I know to set any socket options with labview.

Although this problem might be easier to reproduce using STM, it is not limitted to STM communication over TCP socket as it is a problem with "TCP Write.vi".

Calling "TCP Write.vi" in a loop until the total transmitted bytes = the original buffer length is not an option as this would defeat the purpose of using a non blocking socket in the first place.

Anyone experience a similar issue?

Danny

rolfk · ‎02-07-2015

It's actually possible but not trivial. You can use the Call Library Node to call system functions in a shared library. There are a few difficulties:

1) The LabVIEW network refnum is not the same as a system socket file descriptor. But there is a function in LabVIEW under vi.lib/Utilit/tcp.llb/TCP Get Raw Net Object.vi that returns the underlaying operating system socket file dexcriptor/handle.

2) Calling setsockopt() in libc.so is also possible. The actual numeric values you must use are however different between different OS. Eeventhough Winsock is based on the same BSD socket interface that is used on Linux too, the actual numeric defines vary between them so it is a bit of a problem.

However the SOCK_SEQPACKET that you mention is not a socket option but rather a socket type that has to be specified when creating the socket. This is not accessible from within LabVIEW since the creation of a socket is handled in the Open function and can't be altered.

I can't easily see an option that could be enabled or disabled on an existing socket on any level to change this specific behaviour. Your SOCK_SEQPACKET is also not an option really, it simply is a form of socket that has the semantics of SOCK_DGRAM (used for UDP) and the packet boundery guarantees of SOCK_STREAM (used for TCP).

The only option I could see to have something todo with your behaviour might be the TCP_CONGESTION option available since kernel 2.6.13 that allows to set different congestion algorithmes. But the actually available algorithmes are dependent on the kernel compilation and some of them are only available for priviliged users.

Rolf Kalbermatter
My Blog

rolfk · ‎02-09-2015

One more thing. There is a good chance that this happens on the desktop Linux version of LabVIEW too since this is clearly a function of the socket library software stack. And it wouldn't surprise me if the Windows and VxWorks behaviour is something that got "fixed" at some point in Linux tcp/ip stack for certain reasons. It's definitely debatable what would be the right solution when running into an overflow situation.

As to blocking or non-blocking sockets, LabVIEW internally always uses non-blocking sockets for the network communication layer. The synchronous blocking mode for the Read function is implemented in the LabVIEW layer so that LabVIEW can actively arbitrate the CPU resources while waiting for data to arrive. Yes it could spawn threads instead and let the OS handle all that, but network communication was added to LabVIEW before it supported OS native multithreading and that required active arbitration of the CPU in LabVIEW itself whenever a blocking function was called.

Rolf Kalbermatter
My Blog

Danny_L · ‎02-09-2015

Rolfk, thanks for the clarification/correction. I should have looked a bit deeper before posting.

NI were able to reproduce the issue and are filing a corrective action request to handle these cases more effectively in future version of LabVIEW.

Regards,

Danny

Danny_L · ‎02-12-2015

I finally managed to reproduced the issue on a VxWorks based CRIO.

The real culprite seems to be STM's "Write Message (TCP).vi" that doesn't handle partial write correctly but without support from the lower TCP stack, I can't see how it can be fixed when specifying a timeout to STM's "Write Message (TCP).vi".

As for my Window test, it is probably non valid as the client and server are running on the same host so communication is thru the loopback interface instead of an ethernet card and thus, probably handle this case differently.

Danny

rolfk · ‎02-12-2015

Basically if you specify a timeout then you tell the Write to abort after some time. There is no way to tell the socket to cancel data that it has already handed down to the stack other than closing the connection. The Write simply holds on to the buffer and checks periodically if the socket is able to accept more data and will then attempt to pass it whatever data you have passed in. The socket write will then return a value as to how much data was really passed into a buffer and the LabVIEW Write will then mark the buffer up to that point as being sent, attempting to send the rest later on. Once the timeout occures it will return, indicating to you how much data has been actually sent. You can check (on timeout error or possibly other errors), if this number is higher than 0 but smaller than the actual buffer you passed in and then close the connection and reestablish a new connection to allow the remote end to resync to the data stream.

Yes reliable network communication can get nasty to implement. But that is not a LabVIEW problem but inherent to how the network interface abstracts the actual communication channel and will occur in every programming language. HTTP usually solves that problem in a different way by creating a new connection for every data command-response pair.

Rolf Kalbermatter
My Blog

Danny_L · ‎02-12-2015

@rolfk

I understand. The real issue is with STM. The way STM's "Write Message (TCP).vi" is written is the culprit. It doesn't check "bytes written" returned by its call to "TCP Write.vi" and it doesn't pass it to the caller either.

If STM can't use a "SOCK_SEQPACKET" type of connection over AF_INET then a compromise might be for STM's "Write Message (TCP).vi" to check "bytes written" return by "TCP Write" and either:

* If bytes written is 0 and there is a TCP Timeout error, return "timeout" to the caller

* If bytes written is >0, and there is a TCP timeout error, call "TCP Write" in a loop until the complete STM message is written, even if "Write Message (TCP).vi" would actually take longer than "timeout" to send the message.

STM is message based: either the complete message is sent or nothing at all. Currently, there is a possibility for STM's "Write Message (TCP).vi" to send a partial message but no way for the caller to be aware if this or do something about it.

But I have been doing more test between too windows nodes: one running the client and the other running the server and I still haven't been able to replicate the issue when the server is running under windows. Using my test code, I can get the client to crash within at most 5 minutes when the server is running on either a CRIO-9068 or a CRIO-9075 but I have been running for over an hour now with the server on WIndows and still no sign of STM stream corruption du to partial write.

I can clearly see that even under Windows, STM writes frequently timed out but for some reasons, the windows server node is not producing partial write. Makes me wonder why it seems to work correctly under Windows : Is it LabVIEW's "TCP Write.vi" implementation that is different between RT target and non RT Target or Is it Windows' TCP Stack that is behaving differently.

Danny

rolfk · ‎02-12-2015

Well, the STM library was programmed and tested on Windows mostly and your remote side being overloaded is not exactly an normal condition nor simple to implement for an automated test. Also I'm pretty sure the implementation on Windows and other platforms is basically the same inside the LabVIEW layer. But the Windows socket library can make use of a LOT more system resources in the form of memory, so it will be much harder to run into a situation where the socket library can't accept a modestly sized buffer and apperantly it is more a case of accepting the whole buffer or nothing at all. On an embedded system you do not typically have GB of memory so the socket library can't just go and claim a 100 MB block for itself. There might be differences in the actual implementation of the write() function for the TCP/IP socket library but just as much it may be simply the much bigger amount of memory that is available on a desktop system that causes the socket library to either accept the entire modestly sized data buffer or nothing at all.

I agree that the true fix in the STM library would be to look at the bytes written value and if this is not 0 in case of a timeout error, the connection should be closed and an according error returned to the caller so it can attempt to reconnect. SOCK_SEQPACKET is not a solution since it is a Linux specifc thing and not implemented on other platforms, nor is it a standard connection oriented interface but rather connectionless like the SOCK_DGRAM type.

Rolf Kalbermatter
My Blog

LabVIEW

BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target

Re: BUG : Linux based CRIO don't handle tcp partial write correctly or the same way as it is done on Windows or VxWorks target