LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Sub millisecond timing revisited

14 years late and incorrect exit.

 

Apparently there is a new Wait routine on the latest timing pallet that uses a Wait at Least ( precision) clock cycles rather than the "Wait up to" clock cycles that Wait+ ms.vi observes.

 

Is NI serious about trying to make EVERY OS act deterministic?  Or, were the pizza toppings Ni R&D ate psychotropic?

 

Yes, I published code in a Community Nugget over a decade ago to Wait in an average sub millisecond.   YES, that code specifically duplicated the "Wait up to" behavior Wait msec has on an6 non RTOS.

YES,  Wait, behavior is different on REAL TIME  targets.

 

SO, I REALLY wish that I had participated in the beta... HELL, the BD comments show the specific intention to just give a finger to Wait non RT BEHAVIOR!  I think that this was so inadvisable as to rise toa a  BUG fix that needs a depressiation vi.  Similar to the biased Riffle.vi.


"Should be" isn't "Is" -Jay
0 Kudos
Message 1 of 34
(893 Views)

@JÞB wrote:

14 years late and incorrect exit.

 

Apparently there is a new Wait routine on the latest timing pallet that uses a Wait at Least ( precision) clock cycles rather than the "Wait up to" clock cycles that Wait+ ms.vi observes.

 

Is NI serious about trying to make EVERY OS act deterministic?  Or, were the pizza toppings Ni R&D ate psychotropic?


Oh yes, I fully agree. Indeed, NI should stop increasing the timer resolution from the default 64 Hz to 1 kHz in an attempt to make it more deterministic. Increasing the timer resolution improves granularity, not guaranteed timing precision, as thread scheduling latency and system load still apply.

 

Let me demystify (or re-mystify) LabVIEW timing functions a bit from the perspective of the Windows OS. This has turned into a long write-up, so please read it carefully rather than scrolling through it diagonally — many details are matter.

 

Disclaimer: This is not a guide on “how to achieve a 50 µs delay” in LabVIEW. Instead, it’s an explanation of how LabVIEW timing actually works internally from the operating system’s point of view.

 

Timers in LabVIEW

 

LabVIEW is, by design, a high‑level programming language. If you need access to truly low‑level system behavior, you must use a language that exposes these mechanisms directly—typically C/C++ (using NI CVI, MSVC, gcc, etc.) or assembly. (I would personally recommend the excellent EuroAssembler, although NASM or FASM are also good choices.)

 

In the examples below, I will use Rust 1.96.0, as it has become a popular modern systems programming language. However, a basic understanding of assembly is still required.

 

To understand what LabVIEW is really doing under the hood, you need to be able to build your own DLL and call it from LabVIEW, and you should be comfortable working with the WinAPI and the Windows Debugger (WinDbg or x64dbg). Without this background, LabVIEW remains something of a “black box”: you can experiment with Wait (ms) or high‑resolution timing as much as you like and analyze the results and response, but the underlying mechanisms remain hidden. That said, nobody is born with this knowledge — it’s simply part of the learning curve when moving from high‑level LabVIEW abstractions to low‑level Windows internals.

 

There are three (practically speaking, two) timing sources available in Windows, listed here from highest to lowest resolution:

 

CPU Time‑Stamp Counter (TSC) — this runs at several GHz, depending on your CPU clock (for example, 2.7 GHz (~0.4 ns) on my PC equipped with ancient i7-4800MQ CPU, or 3.1 GHz/~0.3 ns on modern Xeon w5-2445). You can read it using rdtsc() or rdtscp(), but not from LabVIEW directly.

 

High‑Resolution Performance Counter (QPC) — typically around 10 MHz/100 ns. You must query its actual frequency using QueryPerformanceFrequency(), because it can vary, then read with QueryPerformanceCounter(). Internally, this counter is usually derived from the TSC above, just normalized to a lower frequency (in my case, divided by 269...270 on i7-4800MQ). In LabVIEW, this is used by VIs High Resolution Relative Seconds.vi and High Resolution Polling Wait.vi.

 

Windows System Timer (Scheduler Ticks) — this is the low‑resolution system timer used by functions such as Sleep() and GetTickCount()/GetTickCount64(). Its default frequency is 64 Hz, but it can be increased up to 1 kHz/1 ms (or sometimes 2 kHz/0.5 ms). In LabVIEW, this timing source is used by Tick Count (ms), Wait (ms) and Wait Until Next ms Multiple primitives.

 

Its all about these functions:

image-20260602070059484.png

Lets go.

 

System Timer (Low Res Scheduler Ticks)

 

The low‑resolution timer is the “slowest” one. It is used by the WinAPI function Sleep(), as well as in LabVIEW Wait ms and the default timer resolution for this and related wait functions on Windows is 15.625 ms (64 clock interrupts per second).

This means that when you call Sleep(1), the actual delay will typically fall somewhere between about 1 ms and 16.625 ms, depending on where the call occurs relative to the system timer tick. This is the same mechanism used when you call Wait (ms) in LabVIEW (I will skip the assembly details — just trust me that it ultimately maps to Sleep()).

 

Before explaining how it is possible to achieve a resolution better than 15.625 ms (as LabVIEW does), it is important to understand how Sleep() works internally.

 

An operating system runs many threads simultaneously, each in a different state. A thread in the Running state is actively executing instructions on a CPU core, with the instruction pointer (RIP) advancing as the CPU processes its code and associated data—still fundamentally following the well‑known von Neumann architecture. When a thread calls Sleep(<ms>) (or when Wait (ms) is used in LabVIEW), the Windows kernel transitions the thread into the Waiting state. At that point, the kernel saves the thread’s execution context (registers, state, and RIP) into kernel‑managed structures, primarily the kernel thread object; and the TEB (Thread Environment Block) which is a user‑mode structure. A timer object is then inserted into the kernel’s timer queue with the requested wake‑up time, and the thread is removed from the scheduler’s runnable set. These thread objects and their states can be inspected using tools such as WinDbg (see Windows Internals, Part 1 for details, refer to Chapter 4 — Threads, pp.193-300).

 

While the thread is sleeping, it does not consume any CPU time and is effectively suspended in a “stasis” state until its timer expires. When the timer expires, the kernel transitions the thread into the Ready state. The thread becomes eligible for execution again, but it does not run immediately! Instead, the scheduler selects it for execution based on priority, CPU availability, and overall system load.

 

An important implication is that a sleeping thread will never wake up earlier than requested, but it may wake up later due to scheduler latency. This is why Sleep() provides a minimum delay rather than a precise one.

 

This behavior is described in the CPU Analysis article, particularly in the simplified Thread State Transitions diagram:

image-20260602094744137.png

Legend:

  1. A thread in the Running state initiates a transition to the Waiting state by calling Sleep(> 0).

  2. A running thread or a kernel operation readies a thread in the Waiting state (for example, via SetEvent or timer expiration). If a processor is idle, or if the readied thread has a higher priority than a currently running thread, the readied thread can transition directly to the Running state. Otherwise, it is placed into the Ready state.

  3. A thread in the Ready state is scheduled for execution by the dispatcher when a running thread blocks, yields (via Sleep(0)), or reaches the end of its quantum.

  4. A thread in the Running state is preempted and moved to the Ready state by the dispatcher when a higher‑priority thread becomes runnable, when it yields (via Sleep(0)), or when its time quantum expires.

This also explains the difference between Sleep(1) and Sleep(0) (or passing zero to Wait (ms)). A value of zero causes the thread to relinquish the remainder of its time slice to any other thread that is ready to run. If no other threads are ready, the function returns immediately and the thread continues execution — this is why you typically do not observe any visible effect of Sleep(0) on a lightly loaded system. The concept of a “quantum” will be explained later.

 

Also, wake‑up behavior is not tied strictly to a fixed interval such as exactly 15.625 ms. In practice, it depends more on system load and scheduler behavior than on a precise tick boundary.

 

Now we are ready for a trivial experiment:

use std::time::Instant;
use windows::Win32::System::Threading::Sleep;

fn main() {
    for i in 0..10 {
        let start = Instant::now();
        unsafe { Sleep(1) }; // Wait next Tick 
        let elapsed = start.elapsed();

        println!(
            "Iteration {i}: Slept for ~{:.2} ms",
            elapsed.as_micros() as f64 / 1000.0
        );
    }
}

If you lost interest and stepped out when you saw Rust code, then everything below is probably not for you.

 

And the output is:

Iteration 0: Slept for ~6.09 ms
Iteration 1: Slept for ~14.54 ms
Iteration 2: Slept for ~14.36 ms
Iteration 3: Slept for ~15.29 ms
Iteration 4: Slept for ~14.81 ms
Iteration 5: Slept for ~15.21 ms
Iteration 6: Slept for ~14.84 ms
Iteration 7: Slept for ~14.63 ms
...

As you can see, the first call to Sleep(1) takes about 6 ms (it could be anything between roughly 1 and 16 ms), while each subsequent call is around 15 ms. This happens because we become more or less “synchronized” with the system timer ticks and end up consistently waiting for the next one:

|----------X=====>|-X==============>|-X==============>|
     start ^        ^ next start    ^ tick 
 (9.5 ms)+(6.1 ms)|     15.6 ms     |     15.6 ms     |

On each iteration, the thread is moved into the Waiting state, and a timer is added to the kernel’s timer queue. As described above, internally, the OS kernel maintains this timer queue, and its contents can be inspected during kernel debugging with WinDbg using the !timer command:

image-20260530093907999.png

So, for example, if I ask LabVIEW to sleep for 7 days:

image-20260530094928373.png

we will see a corresponding entry in the kernel timer list scheduled for execution next week on June, 6:

Wakeable timers:KTIMER2s:
Address,                Due time,            Exp. Type Attributes, 
663cb747 000005c4 [ 6/ 6/2026 09:46:26.992]  thread ffffa48f68049080 

When the wake‑up time arrives, the kernel selects the most suitable CPU core on which to resume execution. A thread may have a preferred processor, but if that core is busy, the scheduler may try the last core on which the thread executed; if that is also unavailable, it will choose another idle core. For details, refer to the book mentioned earlier (Windows Internals), where this behavior is described in depth.

 

Now, back to the question: “Is NI really trying to make a general‑purpose OS behave deterministically?” In other words, how is it possible to achieve timing resolution better than 15.625 ms?

 

The answer is that the system timer resolution can be explicitly increased up to 1 kHz (i.e., 1 ms) using the timeGetDevCaps, timeBeginPeriod and timeEndPeriod API Calls.. This is exactly what LabVIEW (or the LabVIEW Run‑Time Engine, in the case of built applications) does during initialization on the start.

 

Take a note, Microsoft fundamentally changed the way how Windows scheduler handles timer resolution requests starting with Windows 10, version 2004. Before that, increasing the timer resolution affected the entire system globally. Since version 2004, the behavior has become effectively per‑process, meaning that one application’s timer resolution request does not necessarily impact others in the same way. This also explains why, in the earlier Rust example, you still observed ~15 ms delays — your process did not request a higher timer resolution and not affected by running LabVIEW.

 

You can determine which application has changed the default timer resolution using the powercfg command from an elevated prompt. Simply start LabVIEW and keep it running, then open a Command Prompt as Administrator and execute:

powercfg -energy -duration 1 -output "C:\energy-report.html"

You will see something like this (apologies for the German screenshot):

image-20260531083035155.png

Then open the generated energy-report.html, where you will find the relevant information (sorry for the German screenshot again):

image-20260523071946806.png

You may notice that the value “10000” is shown for a 1 ms resolution. This is because Windows internally represents timer resolution in units of 100 ns. Therefore, 1 ms corresponds to 10,000 intervals of 100 ns. In practice, 1 ms is the lowest resolution you can request via standard APIs.

 

By the way, LabVIEW.exe is not the only NI application that increases the system timer resolution. Two other are tagsrv.exe (the Shared Variable Engine service) and lktsrv.exe (likely related to NI Lookout Time Sync). In these cases, a resolution of 10 ms is requested, which corresponds to a value of 100,000 (again in 100 ns units):

image-20260531093843286.png

And all three yellow warnings shown above originate from NI applications and services. They are marked as warnings because increasing the timer resolution means the system is no longer operating in its most power‑efficient state and may consume more power — which it indeed does. This raises a valid concern about how aggressively NI attempts to make a general‑purpose operating system behave more “deterministically.” Ideally, this behavior should be optional. Most other applications on the system work perfectly fine with the default ~15 ms resolution, but LabVIEW explicitly requests a higher one.

If you scroll further down in the report, you will see an explanation (now translated into English):

Information  
Platform Timer Resolution  
The default platform timer resolution is 15.625 ms (15,625,000 ns) and should always be used when the system is idle. When the timer resolution is increased, processor power‑management technologies may no longer be effective. The increased timer resolution may be caused by multimedia playback or graphics animations.

Current timer resolution (in 100‑ns units): 156250

Now, with the known 1 ms timer resolution and the logic behind it, the experiment that almost every LabVIEW developer has performed can be explained. When sleep time is measured using High Resolution Relative Seconds, this is essentially equivalent to the Rust example shown earlier:

1ms.png

You observe values ranging from 1 to 2 ms. This is a fully expected result. On each iteration, execution is suspended and scheduled for the next available timer tick, which is 1 ms. Therefore, the delay can extend up to two milliseconds, and occasionally even slightly longer because of other threads. This behavior is by design. One millisecond is actually a very long time for a modern CPU — it corresponds to over three million cycles on a 3.1 GHz processor.

 

Hopefully, this clarifies how Wait (ms) works.

 

However, there are two additional points worth explaining for a better understanding.

 

The first is what happens when the application is minimized on Windows 11. This is a common question — why does the behavior change when the application window is minimized? As shown in the example below, you begin to see delays around 16 ms:

sleep-min.gif

This occurs because, by default, Windows 11 reduces the timer resolution for minimized or background applications back to the default ~15.6 ms and this behavior can be directly observed in LabVIEW. Windows applies this limitation to conserve system resources and improve power efficiency, also known as "throttling".

 

It is also possible to keep the timer resolution at 1 ms even when the application is minimized by disabling throttling. There are a couple of ways to do this on a per‑application basis.

 

One option is to use the following command:

powercfg /powerthrottling disable /path "C:\Program Files\National Instruments\LabVIEW 2026\LabVIEW.exe"

(Adjust the path to match your LabVIEW installation or built application.)

You can verify whether throttling is disabled by powercfg with the following command:

powercfg /powerthrottling LIST

This will produce output similar to:

Battery Usage Settings By App
=============================

Application: C:\Program Files\National Instruments\LabVIEW 2026\LabVIEW.exe
        Never On

To revert the setting back, you can run:

powercfg /powerthrottling reset /path "C:\Program Files\National Instruments\LabVIEW 2026\LabVIEW.exe"

Alternatively, you can use the WinAPI function SetProcessInformation() to disable throttling programmatically from the app:

no_throttle.png

With throttling disabled, a minimized application will continue to use the 1 ms timer resolution, and you will no longer observe the characteristic 15–16 ms spikes.

If you do not observe any 15–16 ms behavior even without these changes, it is possible that the setting has been applied system‑wide. This can be configured through power plan settings or, may be, via the Windows registry, don't remember exactly where.

 

The second point is that there is a “half‑official” way to increase the timer resolution to about 500 µs and make the OS behave somewhat more deterministically than offered by NI.

This can be done using the NtSetTimerResolution() API function. The supported resolution range can be queried with NtQueryTimerResolution(), which returns values in units of 100 ns:

0,5.png

You can observe a clear improvement: the upper bound decreases from around 2 ms to approximately 1.5 ms. This is essentially the best accuracy you can achieve using Wait (ms). You still cannot request a 0.5 ms delay directly, but the accuracy of a 1 ms delay is improved.

 

The same applies to Wait Until Next ms Multiple — it is effectively the same as Wait (ms), but with an additional step where the current tick count (via GetTickCount()) is read and the sleep duration is dynamically adjusted to align with the next millisecond boundary. There is no fundamentally different mechanism behind it.

 

Technically, Wait (ms) in LabVIEW follows a relatively simple internal call chain. It starts with the RealWait() function inside the LabVIEW Run‑Time Engine, which then calls an internal ThSleep() function, and directly invokes the Windows Sleep() API. If you run your application under a debugger, you can observe this call sequence.

 

High Resolution Counter

How does High Resolution Polling Wait.vi work? There is another timing source available via the WinAPI function QueryPerformanceCounter().

 

This counter provides significantly higher resolution. Its frequency can be obtained using QueryPerformanceFrequency(), and on most modern systems it is typically around 10 MHz. It is important to always query this value programmatically, as it may vary between systems, although in practice other than 10 MHz values are not seen yet (was lower in the past on very old ancient PCs).

 

The key difference is that this is not a timer, but merely a simple counter. No kernel timer object is created. As a result, in High Resolution Polling Wait.vi, the final portion of the delay (the last two milliseconds) is implemented as active polling — a tight spinning loop that repeatedly checks the counter until the requested time has elapsed (you can check this yourself, the Block Diagram is not password-protected). This is essentially the only way to achieve sub‑millisecond precision on a general‑purpose Windows OS:

image-20260531095011189.png

This VI has already been discussed in other topic Re: Wait for less than 1 ms, so there is no need to repeat all details here again. However, it is important to understand that behind this seemingly simple loop there is significant overhead. The loop itself, the repeated DLL calls, and each invocation of QueryPerformanceCounter() all contribute to CPU load.

 

In the attached PDF, I collected the full call chain. In total, each iteration executes approximately 384 CPU instructions and involves around 15 internal function calls, most of which are introduced by the wrapper layer used for DLL invocation in LabVIEW. Internally, each external call is routed through ExtFuncWrapper()—you can verify this by setting a breakpoint on that function in the debugger.

 

The most important fact, however, is what lies behind QueryPerformanceCounter(). On Windows, it is effectively backed by the CPU’s Time Stamp Counter and implemented using the rdtscp instruction (this time, I’ll bother you with a bit of assembly — sorry):

image-20260531112735741.png

As you can see — Windows applies a scaling factor to convert the raw TSC value to a normalized frequency (e.g., ~10 MHz). Interestingly, this conversion does not necessarily use simple division; it is implemented as 128 bit multiplication by a precomputed reciprocal (for performance reasons, I guess) with lost of 64 less significant bits.

 

As an aside, testing across several system restarts shows that this scaling factor may vary slightly (in my case, around 269.3761–269.3762). This suggests that Windows measures and calibrates the conversion dynamically rather than relying on a fixed CPU specification. Possible reasons include slight variations in CPU base frequency or bus clock adjustments (for example, due to spread‑spectrum clocking used to reduce EMI, which cannot be disabled on my Dell laptop).

 

In practice, this means that QueryPerformanceCounter() can be viewed as a normalized version of the CPU Time‑Stamp Counter, scaled by a calibration factor and simply abstracted by the operating system.

 

CPU Time Stamp Counter (TSC)

 

The third option which is not available in LabVIEW directly is the ultra‑high‑resolution Time Stamp Counter (TSC). This is a counter running at the CPU’s base (invariant) frequency. Please not be confused with the actual core frequency, which can change dynamically (for example, due to Intel SpeedStep). The invariant TSC runs at a constant rate (on most CPUs), which may match the nominal CPU frequency, but not necessarily the instantaneous execution speed of the cores.

 

For example, the CPU on which I am finishing this article runs at approximately 3.1 GHz. Again, this is not the frequency at which instructions are currently executed, but rather the constant frequency at which the TSC increments.

 

This can be measured by counting how many ticks occur within one second:

#[unsafe(no_mangle)]
pub extern "C" fn measure_tsc() -> u64 {
    let start_time = Instant::now();    
    let start_tsc = unsafe { _rdtsc() }; // Read TSC at the start

    while start_time.elapsed() < Duration::from_secs(1) {
        core::hint::spin_loop(); // Wait exactly 1 second
    }
    
    let end_tsc = unsafe { _rdtsc() }; // Read TSC at the end
    
    end_tsc - start_tsc // Return the difference (ticks per second)
}

The result is shown below (note that Rust’s time measurement functions are themselves based on the performance counter discussed above):

image-20260601152052974.png

We can now extend the library and implement a delay function sitting directly on the TSC:

/// Busy‑wait delay using RDTSC
#[unsafe (no_mangle)]
pub extern "C" fn delay_tsc(ticks_to_wait: u64) {
    unsafe {
        let start = _rdtsc();
        loop {
            let now = _rdtsc();
            if now - start >= ticks_to_wait {
                break;
            }
        }
    }
}

Using these self-made timing function, we can achieve slightly better short‑delay accuracy. Of course, the actual results still depend on overall system load, since Windows is not a real‑time operating system!

 

In theory, to achieve a 50 µs delay on a 3.1 GHz CPU, we need to wait for roughly 155,000 TSC cycles:

image-20260601151914497.png

Quite stable. But the major drawback of wrapping such a tight loop in a DLL is that it cannot be interrupted easily. To mitigate this a little bit, I added a 1 ms delay before performing the delay and timing measurement.

 

For reference, here is the corresponding assembly code for delay_tsc(). The hot spinning loop consists of only six CPU commands, I'll bother you with assembly again:

                public delay_tsc
delay_tsc       proc near

                rdtsc
                mov     r8, rdx
                shl     r8, 20h
                or      r8, rax ; R8: Saved Start RDTSC
                nop     dword ptr [rax+00h] ; Alignment
L1: ; --- Hot Spinning Loop ---
                rdtsc
                shl     rdx, 20h
                or      rdx, rax
                sub     rdx, r8 ; Difference
                cmp     rdx, rcx ; RCX is delay
                jb      short L1 ; ---
                retn
delay_tsc       endp

So there is no longer a “heavy” loop with hundreds of instructions — this is the most efficient way to implement very short delays. Whether such precision is actually useful in a general‑purpose operating system is, of course, is a separate question.

 

This simple code provides many opportunities for experimentation. For example, you can extend the delay loop to return the number of iterations executed during the wait to analyze overall "stability" (I would recommend this as a good self-education exercise).

 

Sidenote: as you have already seen, the QueryPerformanceCounter() implementation uses rdtscp rather than rdtsc. The difference between RDTSC and RDTSCP lies mainly in ordering (serialization) and the additional information provided — RDTSCP also returns the core ID on which the code is executing, which can be useful to see how thread springing from one CPU core to another. The classical and recommended way to perform timing with both rdtsc/rdtscp is to surround the measurement with LFENCE instructions to prevent out‑of‑order execution, but I think further details are really off topic, which you can get from Agner Fog Software Optimization Resources.

 

CPU Threads Quantum

 

It is also useful and important to understand the concept of a quantum in Windows scheduling. From the CPU Analysis Article:

"Context switches are expensive operations. Windows generally allows each thread to run for a period of time that is called a quantum before it switches to another thread. Quantum duration is designed to preserve apparent system responsiveness. It maximizes throughput by minimizing the overhead of context switching. Quantum durations can vary between clients and servers. Quantum durations are typically longer on a server to maximize throughput at the expense of apparent responsiveness. On client computers, Windows assigns shorter quantums overall, but provides a longer quantum to the thread associated with the current foreground window."

This is also described in Windows Internals Part 1 pp. 237-238.

 

In other words, a quantum is the amount of time a thread is allowed to run before the scheduler may switch to another thread. On Windows 11, the default quantum is typically around two timer ticks. With a 1 ms timer resolution, this corresponds to roughly 2 ms. However, this does not guarantee uninterrupted execution—interrupts and higher‑priority threads can still preempt the running thread, although you effectively get a very small slice of near real‑time behavior within a short time window.

 

Just for information — on Windows Server, the quantum is usually much longer than on Windows 11 and can also be configured (not sure about Windows 11). I've read somewhere that it is typically on the order of about 12 ticks or more, which, with a 15.625 ms base tick, can result in time slices of roughly ~180 ms before a context switch occurs. On the other hand, these are fairly fine details that are rarely relevant in everyday programming.

 

Finally, it is worth emphasizing once again that this entire discussion is not intended as a guide on how to achieve an exact few µs delay. Its purpose is to help explain how timing and scheduling actually work under the hood.

 

I am not providing a ready‑to‑use implementation, as misuse of these techniques can easily lead to more problems than benefits. However, with a solid understanding of the concepts and the snippets above, anyone can recreate what is needed for their specific use case.

 

Enjoy!

 

Message 3 of 34
(813 Views)

Unfortunately, I cannot add more information to the comment above because of this:

Screenshot 2026-06-02 11.01.33.png

However, one more important point: High Resolution Relative Seconds.vi uses the same underlying counter, but returns the difference between the value captured at the start and the current reading, scaled to seconds as a double‑precision value.

 

relative.png

 

In that sense, there is a kind of “double scaling” involved. Keep in mind that even though the result is a DBL number, the underlying resolution is still limited to 100 ns.

 

Also, I forgot to attach the PDF containing the call graph of the LabVIEW while loop, here it is:

 

Message 4 of 34
(795 Views)

Very extensive and good writeup.

 

Just one nitpick. Make all those Call Library Nodes reentrant. That can make a huge difference. Almost all modern Windows APIs are explicitly save to be called from multithreading. When executed in the UI Thread, all the LabVIEW root loop and competition for the UI thread plays a huge factor that can make your call go from a few microseconds or less to many 100 us or even ms execution time depending on what LabVIEW is doing elsewhere, such as updating a graph or interfacing to the Windows message queue, which is supposed to be called always from the main thread, aka LabVIEW UI thread.

 

Also you can decrease the overhead of a DLL Call by disabling error checking in the according Error Checking Tab. This makes the mentioned ExtFuncWrapper go pretty directly to the actual function call, without setting up all kinds of exception catching, memory trampolines and what else. That definitely should shave of a bit of the Call Library Node overhead.

Rolf Kalbermatter  My Blog
DEMO, Electronic and Mechanical Support department, room 36.LB00.390
Message 5 of 34
(785 Views)

@rolfk wrote:

Very extensive and good writeup.

 

Just one nitpick. Make all those Call Library Nodes reentrant. 

 

Also you can decrease the overhead of a DLL Call by disabling error checking in the according Error Checking Tab.


Thank you very much!


Yes, these are absolutely valid points. However, I must admit that I'm sometimes too lazy to switch to Thread Safe for simple one‑off single calls. It becomes really important when we have many repeated calls, or — more importantly — when calling DLL functions that may take a significant amount of time (for example, hardware initialization). In such cases, the entire UI will be “frozen” while the call is executed inside the DLL.


For the four calls above, there will be no noticeable difference — they are very fast and only set some internal options, such as process information or timer resolution. Moreover, I have not yet explored in detail the internal differences between Thread Safe and UI thread calls. It is quite possible that invoking a DLL in a thread‑safe manner introduces a small amount of additional overhead due to thread management.
This can be verified by tracing the behavior in a debugger.

 

In general, the rule is absolutely and perfectly correct: if it is not strictly necessary to execute a DLL in the UI thread, the Thread Safe option should be used, and unnecessary error checking should be disabled — this is especially important when working with microsecond‑level delays.

 

However, threading is a completely different beast, especially when calling a DLL that is not thread‑safe but still needs to be executed outside the UI thread (which is indeed possible).

0 Kudos
Message 6 of 34
(770 Views)

@Andrey_Dmitriev wrote:

Unfortunately, I cannot add more information to the comment above because of this:

Screenshot 2026-06-02 11.01.33.png

However, one more important point: High Resolution Relative Seconds.vi uses the same underlying counter, but returns the difference between the value captured at the start and the current reading, scaled to seconds as a double‑precision value.

 

relative.png

 

In that sense, there is a kind of “double scaling” involved. Keep in mind that even though the result is a DBL number, the underlying resolution is still limited to 100 ns.

 

 


OK. Wow! First, let me say thank you for your effort and time (no pun intended!!)  

 

I almost never self promote any post that I author as a Community Nugget  

 

In this case, with MY OP in this thread, I wanted to challenge the NI R&D TEAM to justify what I consider a poor effort.   Of note, even the 100nSec resolution Andray defended with admirable scholarship, was addressed by the code I provided to these forums.   Exactly how long ago I personally presented a better solution to determine Elapsed Time is an invitation to circular arguments 😀 

 

So, (with a lot of respect for the thoughts of Andray and Rolf) let me bring MY thread back to the Original Post's intent.  

 

@NI R&D. What were you high on when you exposed that wrong vi on the timing pallet?  Can I legally obtain same?  What bug fix will address the problem you released?

 

Yes, I know it will be as ugly and controversial as a 5th trimester abortion.  You should have searched the forums!


"Should be" isn't "Is" -Jay
Message 7 of 34
(682 Views)

@JÞB wrote:  Wow! First, let me say thank you for your effort and time (no pun intended!!)  

 

@NI R&D. What were you high on when you exposed that wrong vi on the timing pallet?  Can I legally obtain same?  What bug fix will address the problem you released?

 


You’re welcome! I’m not in NI R&D, but to be honest, I don’t quite see where the exact problem is. In principle, it works as designed, the QueryPerformanceCounter() is the approach recommended by Microsoft. The only part that isn’t very well explained in the documentation may be is the increased timer resolution and perhaps some of the timing details. That said, I don’t see any fundamental issues here. Could you explain what you expected to see and what you actually observed that didn’t meet your expectations? Also, I’d be happy to help you get RDTSC working in LabVIEW with pleasure (if it was not addressed yet by the code you  provided to these forums, unfortunately unable to found, please give a link), but it’s mainly a technical question of integrating a DLL into LabVIEW, which is fairly well documented.

0 Kudos
Message 8 of 34
(668 Views)

@Andrey_Dmitriev wrote:

Yes, these are absolutely valid points. However, I must admit that I'm sometimes too lazy to switch to Thread Safe for simple one‑off single calls. It becomes really important when we have many repeated calls, or — more importantly — when calling DLL functions that may take a significant amount of time (for example, hardware initialization). In such cases, the entire UI will be “frozen” while the call is executed inside the DLL.


For the four calls above, there will be no noticeable difference — they are very fast and only set some internal options, such as process information or timer resolution. Moreover, I have not yet explored in detail the internal differences between Thread Safe and UI thread calls. It is quite possible that invoking a DLL in a thread‑safe manner introduces a small amount of additional overhead due to thread management.
This can be verified by tracing the behavior in a debugger.

 

In general, the rule is absolutely and perfectly correct: if it is not strictly necessary to execute a DLL in the UI thread, the Thread Safe option should be used, and unnecessary error checking should be disabled — this is especially important when working with microsecond‑level delays.


Actually if the Call Library Node is set reentrant it is always less overhead. A Call Library Node set to execute reentrant simply runs in whatever thread the current diagram is running. It will fully block that thread for the duration of the call, but if the VI is not configured to run in the UI Thread itself, that will leave LabVIEW at least about 7 other threads it can use to schedule other parts of the diagram to execute in as long as they have no dataflow dependency.

 

If the Call Library Node is set to execute in the UI thread, and the VI is not set to execute in that thread, LabVIEW has to arbitrate for that thread and wait until it is available from other functions that want to use it as well as the message handling loop (root loop) with the OS and then it must do a thread switch which costs quite a bit of performance.

 

The Error Checking is also significant but pretty much independent of the threading. The lowest error checking basically disables any additional protection. The intermediate level will install exception handling around the call, that can catch OS and CPU exceptions, while the highest level will also create so called trampolines around memory buffers passed in. These trampolines are filled with a specific bit pattern and after the call verified to still contain these same bit patterns. If they don't, a buffer overflow happened. Any of these errors and exceptions are returned as 1097 error from the Call Library Node.

Rolf Kalbermatter  My Blog
DEMO, Electronic and Mechanical Support department, room 36.LB00.390
Message 9 of 34
(619 Views)

Awesome write up! Thank you.

This may be slightly off topic, but I was looking into something similar last year to get more reliable scheduler loop timing on Windows 11 without adding too much CPU load. (Running High Resolution Polling Wait.vi in a loop is pretty CPU intensive.) 

 

I was hoping to use "CreateWaitableTimerExW" with the CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag set. Not just to get <1ms timing, but to also make use of the periodic ability in SetWaitableTimerEx.

 

I thought I could offload the scheduler timing to the kernel and get a reliable 5ms schedule.

Unfortunately I could never get it to work. It seems there might be some undocumented quirks.

 

Do you think these new waitable timer functions could offer a path to reliable sub-millisecond timing in LabVIEW (on newer versions of Windows)?

Troy - CLD "If a hammer is the only tool you have, everything starts to look like a nail." ~ Maslow/Kaplan - Law of the instrument
0 Kudos
Message 10 of 34
(601 Views)