Let me begin by explaining my current setup, I apologize in advance for the length of this post, but I wanted to be as thorough as I could in describing the problem.
We have a data collection system which contains a NI-8106 Real Time controller located in a NI-1045 PXI chassis. Two chassis are connected together using NI-8336 MXI connection boards. The system has a number of NI-4472 DAQ cards and each chassis contains 1 NI-6652 Timing cards. This system is designated as the "Server". The Server is running LabVIEW 2010 Real Time with embedded software used to control the acquisition and transmission of data.
The "Client" system is contained in a Dell desktop running a LabVIEW 2010 program used to receive data from the Server, plot the data real time, and save data to binary files. The Client is running Windows 7 Ultimate.
Currently the Server-Client data transferal is performed using TCP/IP. The Server maintains the data in a Queue, until it finds a Client wishing to receive the data (TCP Listener). Currently only one connection to the server can be made at a given time. All other requests will be denied. The Client initiates the transmission of data after establishing a TCP connection with the server by writing a string to the TCP connection. Upon receiving the correct string the Server responds with the size of the data packet (in bytes). The server immediately sends the data packet (binary data flattened to a string). The Client interprets the first transmission as an integer number, which is then used to perform the second TCP read (the data packet).
This data flow results in the Client program performing 1 TCP/IP Write and 2 TCP/IP Reads. Originally the Client had a 10 second time out for all three TCP commands. We would encounter an issue with this setup at random intervals ranging from 15 minutes to 4-5 hours after the Client began collecting data. This issue resulted in the Client computer throwing a Time Out Error (Error 56) when the second TCP Read function failed to read the data packet in the allotted time.
Through the use of Wireshark it was determined that immediately prior to the crash an ARP (address request protocol) command was issued by the 8106 controller. This ARP request was to map the IP address of the Client computer to its MAC address. The Client would immediately respond with the correct MAC, but it would take the 8106 longer than 10 seconds to update its ARP table, and the transmission of the data packet would time out. After a number of troubleshooting attempts it was determined that if the time out on the second TCP Read function was removed, data transmission would continue as normal approximately 12-14 seconds after the ARP command was issued.
While this is not a problem in regards to data loss (the data is stored in a Queue on the Server side, and upon updating the ARP table the Server can transmit much faster than we are collecting packets), we do run into a problem with user interaction. The Server continuously polls a time server (once a second), and never seems to have a problem with the ARP map to it. We have tried multiple NICs in the client ranging from the onboard NIC provided by Dell on the motherboard, various Intel Server Grade NICs. All seem to have the same behavior, however, if an older laptop is used to run the Client program (again an Intel NIC contained onboard the provided motherboard; laptop runs Windows 7 Professional), the problem does not arise.
First, does anyone know what is causing the 8106 to lose the ARP map to the client? We do not use a DHCP server; all devices on the LAN are statically addressed. The IP addresses do not change at any time. Second, is there a way to hardcode the ARP table so that this issue does not arise? Finally, would the Client NIC have any impact on this behavior occurring?
The version of the network stack being currently used in PharLap ETS does not have any public interfaces to allow a customer to add permanent (static) ARP table entries. This is still not supported on our PharLap targets; it's a limitation of the actual network stack that's used in these RT systems.
Regarding the question if a Client NIC has any impact on this behavior occurring, I don't see a reason why this could affect the behavior of your application, I think this could be more related the application layer of your project.
You can possibly change your code (client side) in order to do a kind of handshaking (more often) in order to refresh the connection between your server and your client application.
Can you post the wireshark trace of this issue? It sounds pretty odd. The fact that a different client system doesn't show this issue when communicating to the same server it pretty interesting. I can't say I've seen something like this before.