Tricky multi-threaded challenge

ElectroLund · ‎11-08-2023

I am working on an interesting translation project. A fellow engineer developed a very rigorous algorithm using Excel. In put data is two arrays (Excel columns) of 1024 cells each. Here's a glimpse into the spreadsheet. Images below show the formula in one cell to visually illustrate the window in the averaging and the successive columns with iterate on this mechanism (he's smoothing 10 times).

I've been able to translate the above into C and verify. Not terribly hard.

My colleague then does an elaborate indexed slope calculation to find an asymptotic decay of the waveform.

2023-11-08 14_20_20-Spectrum to Voltage (Paul Z).xlsx - Excel.png

Again, this wasn't hard to translate. All in all, the given worksheet takes my machine about 6.8 seconds to crunch, one loop at a time.

Here's the question. As I implement these very custom array calculations, I'm basically doing a lot of For loops, each with 1024 indexes. Due to the structure of the algorithms, each successive index depends on previous array indexes such that simply making these all multithreaded wouldn't provide much benefit. In other words, for Array Function 2 to work, Array Function 1 would need to get at least N indexes completed. And for Array Function 3, and so on.

Am I missing some elegant optimization here?

gdargaud · ‎11-09-2023

Hello Electro,

tight loop optimization is not the domain of multithreading, but good news, it is the specialty of MPI, which is embedded in CVI !

The good news is that it requires no modification to your code. But it IS difficult to implement and above all to verify (often your tightly written mpi pragmas have no effect at all), but on modern processors with huge pipes and large hyperthreading, the gains can be huge. Look it up !

Don't forget to look at the generated assembly code and to run benchmarks. gcc also has an option to ask mpi and the optimizer to explain what it is doing or not but that's behind the point here.

ElectroLund · ‎11-09-2023

gdargaud, you've introduced me to a brave new world! I'd never heard of this technique before now. But yes, that seems exactly what I'm in need of. However, I couldn't find any mention of MPI in the CVI Help search (either "MPI" or "Message Passing Interface"). Could you point me in a direction within the CVI world? I also checked the example projects.

To put more definition to other readers, here is some chatbot breakdown:

MPI (Message Passing Interface) is a widely used standard for writing parallel programs that run across multiple processors or machines. To utilize MPI, you would need to divide your problem into smaller tasks that can be executed independently. Each task can be assigned to a separate process, which can then perform the calculations in parallel. The processes can communicate with each other using message passing to exchange necessary data.

Here's a general outline of how you could approach parallelizing your code using MPI:

Set up MPI: Initialize MPI and determine the number of processes available.
Distribute the data: Divide your spreadsheet data among the processes. Each process should receive a portion of the data to work on. You can use MPI functions like MPI_Scatter or MPI_Scatterv to distribute the data.
Perform calculations: Each process should independently perform the calculations on its assigned portion of the data. Since the calculations rely on indexes before and after the current index, each process may need to exchange boundary data with neighboring processes using MPI communication functions like MPI_Send and MPI_Recv.
Gather results: After the calculations are complete, you can use MPI functions like MPI_Gather or MPI_Gatherv to collect the results from all processes onto a single process.
Finalize MPI: Terminate MPI and clean up any resources that were used.

By parallelizing your code using MPI, you can potentially reduce the execution time by utilizing multiple processors or machines to perform the calculations concurrently.

Please note that implementing parallel computing techniques can be complex, and it may require modifying your existing code. It's recommended to consult MPI documentation and examples or seek guidance from an experienced parallel computing expert for a more detailed implementation specific to your code and requirements.

gdargaud · ‎11-09-2023

Sorry, I had a brain fart. I meant OpenMP !

Both are for parallelizations but are very different. I've used both but a long time ago, so I can't really go into details.

In CVI, you enable it in [Build Options][Enable OpenMP support] and then you 'just' add the appropriate pragmas before your *for* loops. It's open source, so it's easy to find examples and support.

And I think the technique you want first deal with loop unrolling.

ElectroLund · ‎11-29-2023

Well I've been reading and experimenting with OpenMP since you recommended it. And it is indeed fascinating! It's also very easy to implement in code, much easier than going the CVI MT route, which I have experience with.

After some work, I made a test project (see attached) that demonstrates both single and multi threaded loop algorithms. In this, there are a for cascading loops. It's not identical to my original post, but it's similar in principle. Each loop grinds through a hashing function, affecting each data point in a 1024 index array. This array is then passed to the next loop, where the data is crunched again, and again. So they are all interdependent.

I don't see much improvement with the OpenMP usage.

I suspect there are some library tags that I'm not utilizing or correctly. Thoughts?

ElectroLund · ‎12-14-2023

Here's a pro-tip that I've discovered in this OpenMP journey: run the final executable outside the debugger! That layer slows down execution so much -- upwards of 6x in my experience.

LabWindows/CVI

Tricky multi-threaded challenge

Tricky multi-threaded challenge

Re: Tricky multi-threaded challenge

Re: Tricky multi-threaded challenge

Re: Tricky multi-threaded challenge

Re: Tricky multi-threaded challenge

Re: Tricky multi-threaded challenge