cFP disconnected in MAX but responds to UDP broadcasts

Shehzaada · ‎02-23-2016

Hey guys,

I have 4 cFP controllers in the field monitoring a whole bunch of sensors. Each sensor has it's own calibration, and so I built an OO hierarchy based on Dynamic Dispatch to handle it. Now there is one piece of generic software and the 4 controllers use it to do their tasks (Acquire and Log). The first three controllers work great and run for days, the fourth dies within 1 hour of running. It's the exact same architecture and code on the controllers.

I have a seperate Network thread that sends out some Health data of the device over a UDP port - it broadcasts it to the network (CPU Usage, HDD Space, Uptime). For the fourth controller, I'll notice after an hour that it's disconnected in MAX, I can't FTP into it, I can't connect through the LabVIEW project BUT this Network thread is still alive and broadcasting over the network.

Can anyone answer the following two questions?

1) In regards to the broadcast signal being active - why is that exactly? I can't get into anything else through MAX or the FTP Server...
2) In regards to the crashing of the fourth controller -> The fourth controller has some complex math where it tries to determine the root of a polynomial. The NI Toolkit NI_AALPro is used for this, and is pretty heavy. It even has a DLL call. I have a suspicion that these VI's are causing the RT Controller to crash. I'll double check tomorrow by disabling these within the code, but has anyone done this type of thing before and run into issues?

Thanks!

ahillinaustin · ‎02-24-2016

Hi Shehzaada,

Thanks for posting on the NI Discussion Forums for help with your cFP crashes. For you first question, it's definitely odd that you can receive UDP broadcasts but not TCP data through MAX. Does that data change? Or is it possible the cFP isn't broadcasting anything new, just the last value it had and continues to send that out? Do the other 3 controllers still send out UDP broadcasts as well?

As far as the 4th controller crashing, is the complex math the only difference in the code the AALPro toolkit usage? Do you have any errror handling enabled on your VIs that are deployed?

Lastly, what version of FieldPoint and LVRT are you running on the devices?

Austin H.
Applications Engineering
National Instruments

Shehzaada · ‎02-24-2016

Hi Austin!

Many thanks for responding to my questions.

1) Yes, the data changes. I have two Health parameteters that include Device Team and Uptime. Both update on the supposedly disconnected controller even though I can't FTP in and I see it disconnected in MAX

2) The four controllers are identical interms of the software deployed on them. I have a polymorphic setup to facilitate this. The fourth controller is the only one that uses the AALPro toolkit by virtue of having some channels that need it. So to answer your question, this would really be the only 'difference'. The architecture is the same, but due to dynamic dispatch they do not execute identically (depends on the channels that are assigned to the controller).

3) FieldPoint Version 13.1.0, LabVIEW Real Time 13.0.1

Tomorrow I'm going to disable those channels and leave the application running to see when/if it crashes. I'll also try at some point to leave it running in dev mode to see if it crashes.

Shehzaada · ‎02-25-2016

Also I have another loop that simply blinks the Status light of the cFP at 1 second intervals....that's also working yet the cFP is disconnected in MAX

ahillinaustin · ‎02-26-2016

Hi Shehzaada,

Based on the other symptoms, it looks like the reason the cFP is failing to communicate is that AAL_Pro code. However, it seems to me that the rest of the RT code is working on that cFP (the UDP broadcasts, the LED blink, etc.). It appears that the code that's 'crashing' the cFP blocks communications on the TCP side for whatever reason.

Did you ever get a chance to test the execution without the advanced math code on the 4th controller?

Austin H.
Applications Engineering
National Instruments

Shehzaada · ‎02-26-2016

Yes Austin, I ran some more tests.

1) I ran the 4th controller code in the development environment with the AAL_Pro code enabled. It crashed. This is good news since we can replicate the problem in source

2) I ran the 4th controller code with AAL_Pro disabled. It crashed.

3) I ran a simple loop with the AAL_Pro operating on its own on another controller. It's been running for 20 hours now.

I also added another health parameter to each cFP that tells me the amount of space left on the external USB drive attached to it. So I see that the "disconnected" 4th controller is actually physically logging to an external disk. Of course there's no way to get to it since its FTP server is down.

My suspicion now moves to the controller itself. The next set of tests I'll run is:

1) Run the 4th controller's code on the test controller. If it runs fine, then the 4th controller has an issue.
2) To further the suspicion for #1, run a simple program on it and see how it reacts over 10+ hours.

I hope these tests give some conclusion. Unfortunately the 4th controller is in a remote area and I'll have to wait over the weekend to physically reset it.

iCan'tBerrPuns · ‎02-29-2016

Hello Shehzaada,

I am very interested in the results from testing the 4^th controller’s code on your test controller. If you are successfully able to run your original code on the test controller for a prolonged period of time, then we can deduce that the issue lies with the controller itself.

Once you have an update on your testing, please let us know!

Gabby
National Instruments Applications Engineer

Shehzaada · ‎02-29-2016

Hi Gabby,

Just tested the code on the test controller and it exhibits the same behavior...it's logging and the status light is lit, however, it's disconnected in MAX.

This means the problem should lie in the code itself. It's polymorphic, so it's the exact same code base. I just override a Calibration VI based on the type of sensor coming in...and the ones that were specific to the 4th controller have been disabled. It still fails!

I'll just have to go into the code and dig manually now....maybe the first step will be to diagram disable the Calibration VI as it is and see what happens.

This thing hasn't seen the end of me yet.

Shehzaada · ‎03-03-2016

Hey guys,

After some extensive experiments, the 4th controller cFP mystery has been solved. It's been running on the newly deployed EXE for 26+ hours, whereas before it would crash within 2 hours. The other cFPs have been running for 6+ days now.

Here's the lowdown on what the problem was. I think this problem has been affecting us since the 4th controller first deployed. Funnily enough, both the controller and it's network connectivity are absolutely fine...

When running NI's root classification VI's the previous developer dynamically built an array of coefficients every time the VI was run. In this case, the building of the array was done every 5 seconds until....forever.
Since the building of the array is on the fly (dynamic), the controller has to find memory to allocate to the array each time.
The result is that the controller keeps allocating memory faster then it can release it. This causes the controller to simply run out of memory and crash the system
The solution is to simply pre-allocate this memory once and just replace the first element with the Data Point coming from the module. Memory allocation no longer needed, and the controller is steady.

Here's the most interesting part:

When the controller runs out of memory, it actually stops all other functions aside from what the source code is doing. So it stops responding to MAX (MAX queries the devices through TCP), and stops its FTP server. This is why an exernal FTP Service that is used to get data out of the cFP wasn't reliable.
MAX tells us that the cFP is disconnected, which then leads us to believe that the problem lies with the controller itself OR just the connectivity points.
However, the 4th controller continues to respond to UDP Broadcast requests made by the Main Controller. Which clearly means that the cFP is alive.
This means that controller is doing what it's supposed to be doing, but has shut down all other functions

Thanks everyone for their support. I'm glad this thing is done and dusted!

FieldPoint Family

cFP disconnected in MAX but responds to UDP broadcasts

cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts

Re: cFP disconnected in MAX but responds to UDP broadcasts