Actor Framework Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

Actor framework slowing down to point of crash after too many messages

I have a cRIO project that actor framework has been implemented on, the project controls a subsystem of a larger test station. The cRIO receives commands from an external system (PC running a LabWindows program) via RS485. The cRIO receives the commands parses them out and enqueues actor messages. The cRIO will run well through the first couple iterations of the test program (takes about 2 hours per) but will noticeably start to lag in its response time to received messages and slowly start to fall behind in handling the messages until the cRIO finally crashes and has to reboot. You can tell it is starting to fall behind after the first iteration through the test program because after the test process has completed on the PC the cRIO is still responding to enqueued messages, messages that were sent when the program was still running. After about 4 hours (2 tests) the response time has increased from about 0.3s to 2s+ and then it just craps out. 

 

The cRIO receives commands to turn different things on/off, and is polled about 1/s to return some statuses (temperatures, flow rates...) but also monitors a few PID loops on the FPGA side. We are running the subsystem on a different station and we don't have this problem, though the other station doesn't poll the cRIO for status nearly as much. The other station allows the subsystem to monitor itself and throw interrupts if it has a problem. The problem station does its own monitoring, which is the reason why it polls so much. Still though, is polling 1/s really that much? It seems like a pretty slow rate, or at least i think it is. I can duplicate the issue if I setup a dummy program on my computer to just blast the cRIO with commands and they do start stacking up and the crash is duplicated. No errors. 

 

Some things I have tried to resolve this:

The polling used to run much faster we reduced that, we are unable to reduce the polling any further. 

 

I setup a queue. When messages are parsed if it's a return status message, I search the que for the status message, if it is in the que, I don't send the actor message to return status. If the message wasn't in the que I add it to the que and allow it to send the actor framework message, then when the return status VI is run it is dequeued.

TheWolfmansBrother_0-1662035515592.pngTheWolfmansBrother_1-1662035632357.png

 

Setup a new command (actually repurposed an old one), when the test station completes a test run, the station sends the command to the subsystem, the subsystem response to the command is to use Visa Flush IO Buffer.vi. In the same scheme we also setup to clear the io buffer on the station computer when a test completes. 

TheWolfmansBrother_2-1662035721823.png

 

turned off logging, thinking maybe recording the logs of every command was too much.

 

Here is what the parent class actor core for receiving the messages looks like:

 

TheWolfmansBrother_3-1662035904551.png

 

Any ideas or helpful methods to determine what is causing the commands to stack up? 

 

Thanks

 

0 Kudos
Message 1 of 6
(1,146 Views)

Override Receive Message and log the message class name to disk.

 

Obviously the extra load will probably cause the problem sooner, but you'll have a file you can check that tells you all the messages received.

 

I'd expect that one is appearing much more frequently than you expect, perhaps relating to the PID loops (I had an FPGA state machine that sent messages in each state to cRIO via FIFO, it worked fine until I entered the idle state - since it was on FPGA I didn't have a wait but then the idle state sent an update every 25ns, not exactly the same but maybe illustrative of a problem that can occur).


GCentral
0 Kudos
Message 2 of 6
(1,131 Views)

Which class should I setup the override in? I have the class that reads the serial port then it's child class parses and sends messages to the other actors to do what they need to.  

 

Also, I had been logging every byte that came in on the serial port, logging the bytes parsed to commands, then logging the sent commands (AF messages), and finally logging the returned statuses sent out on the port. I didn't see any sent received messages occurring in a higher than expected frequency. Or will this record the AF messages?

0 Kudos
Message 3 of 6
(1,122 Views)

@TheWolfmansBrother wrote:

Which class should I setup the override in? I have the class that reads the serial port then it's child class parses and sends messages to the other actors to do what they need to.  

 

Also, I had been logging every byte that came in on the serial port, logging the bytes parsed to commands, then logging the sent commands (AF messages), and finally logging the returned statuses sent out on the port. I didn't see any sent received messages occurring in a higher than expected frequency. Or will this record the AF messages?


Which class is receiving messages? I would start with wherever I thought there were lots of messages coming in, but if that didn't show anything, you can copy-paste the contents of the override (except the class in/out) to other class overrides and just reconnect. The file logging is worse in that case, temping to include a FGV that logs the string to a file and manages to open on first call maybe. (Obviously this is a terrible coupling and shouldn't be kept in general, but is the "quick and dirty" way to find out what's going on).

 

I see the "LOG" VI in your images - so I guess if the frequency you're seeing there is as expected (and you don't believe that that logging is causing the issue) then you could exclude those as concerning issues. 

 

I see the log error message as a possible cause, but I'm not sure what it would take to cause that to spam you - there's the 10ms delay on the timeout case, and events aren't inherently triggered by error usually, so the likely repeat offender would be a failure to Enqueue Message, but that would prevent it causing a message overload (because it would be failing to enqueue!)


GCentral
0 Kudos
Message 4 of 6
(1,104 Views)

So if your messages are getting stacked up, you're receiving new messages faster than you can respond to them. This implies you're either sending way more messages than you think you are (cbutcher's suggestion) or you're not processing messages as fast as you think you are.

 

To detect the first, use cbutcher's method to see how many messages come in. You could also use the Real Time Execution Trace toolkit (I think).

 

The second one will be tricky. I can start by saying you should probably stop using Bytes at Port and switch to a termchar-based system instead. I don't know exactly how the root loop works on RT targets, but I do know they're generally more resource constrained than desktops, and that property node can interrupt a lot of things. You may be constantly preempting your main servicing thread by polling the serial port. Additionally, you always read if there are ANY characters, meaning you don't know if you're getting a full message or not. You also send that message to Actor Core which has to do some processing on it, even if the message isn't fully complete. Take a look at this video from crossrulz: https://www.youtube.com/watch?v=J8zw0sS6i1c

 

You can also try changing the timeout from 10 ms to 20 ms and see if that changes anything. Alternatively, change it to 5 ms and see if the crash happens faster.

 

Other than that you'd need to look at your messages and see which ones are taking a long time to execute. Maybe you're not getting much data from the serial port and that's not the problem. In that case you could add some timing benchmarking to see how long each message is taking to handle, and adding that to your log. I bet something will jump out as taking longer than you think (or happening more frequently than you think).

0 Kudos
Message 5 of 6
(1,093 Views)

I'd also recommend removing the queue that deduplicates messages. It sounds like you only need it to workaround the messages getting backed up. Once you fix the underlying issue, you won't need to deduplicate. And in the meantime that extra complexity might make this harder to debug.

0 Kudos
Message 6 of 6
(1,073 Views)