I think it's quite unrealistic to think that the LabVIEW implementation will beat your embedded C implementation but I would have thought that it'd do better than that.
A question towards the C implementation ... did that also have a thread talking out on a serial port?
There are a couple of things you could try to get the labview implementation slightly more efficient (I've got no idea though how much is in it). First is when you have the implementation with the timed loops configure all timed loops to NOT keep their original phase. When the program starts all loops obviously request processing time at the same moment which means there is an overhead of arbitration going on. By configuring the loop to not keep the original phase this will happen once and subsequently the loops will keep stick to their new phase so that nor arbitration is necessary.
In your while loop implementation you could also try to to put a timed sequence structure around the while loop. This forces the while loops in a new thread (you won't really care about the timing of the sequence structure ... this would also be something you could do in the timed loop implementation to wrap around the visa loop.
I never expected, or expect, LabVIEW to match the efficiency of C. I'm happy to pay a "performance premium" as a trade off for easier programming.
As detailed in : https://forums.ni.com/t5/LabVIEW-Embedded/LabVIEW-Embedded-Performance-Testing-Different-Platforms/... and in this thread, there are some interesting performance degradation.
For a "typical" application, LabVIEW on a desktop PC runs at about 90% efficiency [10% loss] . The same application on a cRIO runs at about 30% efficiency [70% loss] . Happy with the 90% efficiency , but the 30% efficiency is a bit too much. Also, for very small loops, the efficiency drops to 1% [99% loss!].
When I did my calculations, I took out the "background" CPU load due to the serial port. That is, the serial ports loading effect has been taken into account in the calculations.
I've tried various things to get the loops to run faster. I tried putting each loop in its own VI running under a different execution system. I tried timed loops. I tried everything I could think off. In the end, the figures provide are as good as it gets.
I would also be interested in hearing if anyone with more experience understands this issue. That being said, I am not sure where you are getting your numbers. I have used embedded RTOSs before (non-LabVIEW) and I was NEVER able to get anywhere near a 3.8 MHz loop rate with an RTOS. With bare metal (no OS), I do agree that you can get up into the MHz. But my (admittedly limited) experience with commercial RTOS software is that you cannot get more than 10's of kHz, and MAYBE 100s of kHz if you really push it and swamp out the CPU. Unless you use bare metal cores or true kernel replacements (e.g. Xenomai instead of PREEMPT-RT), I don't think you can get there.
The problem I have encountered in the past is with context switches. You are accounting for one context switch per loop iteration. This could be the case because your loops are short. However, in past projects I have seen loops that execute in 10's of microseconds go through hundreds of context switches per iteration, but this depends on processor interrupts and priority configurations. Also, the RT is not just running your code, it is also running all of NI's communications and processing apps in the background, plus RTOS interrupt handlers and other functions to support RTOS functionality.
I know with the desktop timed loops you can configure the execution priority. Have you tried setting up timed loops in RT with highest priority? I'd be curious to know if that changes anything.
Again, I'd love to hear someone with more experience chime in on this issue. Performance issues can be notoriously tricky...but I see no reason that well-written LabVIEW code should run any slower than the equivalent C-code.
The expectation of 3.8 MHz loop rate from a 400 MHz / 760 MIPS processor using an RTOS is for the case of a very simple loop (integer increment, timing check, compare and branch only). Not real world, but it does provide a benchmark that can be trialled on different platforms.
Keeping in mind how very simple the loop is, what sort of performance would you expected from a 400 MHz / 760 MIPS processor using an RTOS? You mentioned never being able to get anywhere near a 3.8 MHz loop rate with an RTOS. Were you executing a very simple loop as per the example and how many MIPS was your processor?
Ultimately, I would like to run a few "reasonable" sized loops at say 2 kHz each. This does not appear to be achievable. (My bare metal 200 MIPS processor can run multiple reasonable loops as MHz rates, but this is a not an apples to apples comparison, however it does show a possible solution for obtaining speed and a very low cost.)
The project I was referring to used a 1GHz processor with 2400 MIPS. Using a lightweight, stripped down Linux RTOS (just to be clear, this was an RTOS, not vanilla Linux) we were able to sustain one ~30kHz loop and several 5kHz loops before causing excessive CPU usage. Again, this performance depended heavily on how many interrupts were occuring and triggering context switches during the loop execution. In contrast, bare metal testing allowed us to get a single loop running at nearly 3 MHz.
Some RTOS's allow you to configure loops to run with nearly bare-metal performance, in which case I agree you can get up into the MHz for single loop rates. My guess is that the cRIOs are doing a good amount of processing in the background to handle the I/O, peripherals, NI background tasks, FPGA communications, etc and those asynchronous events are causing context switches. That being said, your loops are very short (only one increment) so I would expect better performance than you are seeing.
just to add my opinion... I would have never even thought to try running a loop in RTOS faster than 1 KHz-- in which case I would be using a a scan-engine locked timed loop for a control loop without bothering with the FPGA. Faster than 1 kHz I would write custom FPGA code instead of using scan engine.
What sort of things would you be using such a fast loop for, if you can't get data into and out of your IO that fast?
yeah that low efficiency doesn't make a lot of sense. I bet that the Real Time Execution Trace tool kit would come in handy here to see what is happening. Hopefully the issue will get some attention from NI. I am curious if the same behavior is displayed on the RT Linux cRIOs. That is open source so someone in the know could figure it out.
MarkCG: For many years (before we had the ability to write floating point FPGA control algorithms) we used an RTOS, typically with one loop running at ~20 kHz and several loops running at ~5 kHz, to perform motor control algorithms. This kind of speed is crucial because the signals in question change very rapidly. You'd be surprised how much code we were able to fit in the high speed loop (even when written in C instead of assembly)! Thanks for pointing out the Real-Time execution trace tool. I didn't even know something like that existed for RT debugging.
vitoi: There is definitely a big performance gap you've identified. Hopefully someone from NI will chime in!