LabWindows/CVI

cancel
Showing results for 
Search instead for 
Did you mean: 

Multithreading and partitioned shared memory

Solved!
Go to solution

Hi All,

 

I'm having no success with this (simple?) multithreading problem on my core-i7 processor, using CVI 9.0 (32-bit compiler).

 

In the code snippets below, I have a node level structure of 5 integers, and I use 32 calls to calloc() to allocate space for 32 blocks of 128*128 (16K) nodes and store the returned pointers in an array as a global var. 

Node size in bytes = 20, block size in bytes = (approx) 328KB, total allocated size in bytes = (approx) 10.5MB.

 

I then spawn 32 threads, each of which is passed a unique index into the "node_space" pointer_array (see code below), so each thread is manipulating (reading/writing) a separate 16K block of nodes.

 

It should be thread safe and scale by the number of threads because each thread is addressing a different memory block (with no overlap), but multithreading goes no faster (maybe slightly) than a single thread.

 

I've tried various threadpool sizes, padding nodes to 16 and 64 byte boundaries,  all to no avail.

 

Is this a memory bandwidth problem due to the size of the arrays? Does each thread somehow load the whole 32 blocks?  Any help appreciated.

 

struct  Nodes

   {
   unsigned int a;  
   unsigned int b;  
   unsigned int c;
   unsigned int d;  
   unsigned int e;

   } ;                                           
typedef struct Nodes  Nodes;
typedef  Nodes   *Node_Ptr;

 

Node_Ptr            node_space[32];          /* pointer array into 32 separate blocks ( loaded via individual calloc calls for each block) */

 

.... Thread Spawning  ....

         for (index = 0; index < 32; ++index)

             CmtScheduleThreadPoolFunction(my_thread_pool_handle, My_Thread_Function, &index, NULL);

 

0 Kudos
Message 1 of 7
(4,554 Views)

Hi again,

 

Sorry, the block indexing scheme in the orig post was not right.

I actually use an indexed list of integers to provide the thread with a unique value via the pointer-to-data parameter in CmtScheduleThreadPoolFunction.

should be:

 

struct  Nodes

   {
   unsigned int a;  
   unsigned int b;  
   unsigned int c;
   unsigned int d;  
   unsigned int e;

   } ;                                           
typedef struct Nodes  Nodes;
typedef  Nodes   *Node_Ptr;

 

Node_Ptr            node_space[32];          /* pointer array into 32 separate blocks ( loaded via individual calloc calls for each block) */

int  index_list[32];

/* <<<<< all above are globals >>>>> */

 

.... Thread Spawning ....

        int index;

        for (index = 0; index < 32; ++index)

            {

            index_list[index] = index;

            CmtScheduleThreadPoolFunction(my_thread_pool_handle, My_Thread_Function, (index_list + index), NULL);
            }

 

0 Kudos
Message 2 of 7
(4,543 Views)

Hello CVI_Rules!

 

Have you considered the following options in order to enhance the performance of your application?

  1. Using OpenMP for parallelization of your operations on your data. The CVI 2013 compiler added support for OpenMP. You can read more about OpenMP in CVI here: https://decibel.ni.com/content/docs/DOC-29830
  2. Setting preferred thread affinity to ensure your threads are running on separate cores (e.g. SetThreadIdealProcessor😞 http://www.ni.com/white-paper/3663/en/
0 Kudos
Message 3 of 7
(4,527 Views)
Solution
Accepted by topic author CVI_Rules

Hello CVI_Rules,

 

It's hard to answer your question because it depends on what you are doing in your thread function. Since you are not seeing any speed up in your program when you change the number of threads in your thread pool, you are either doing too much (or all of the work) in each thread, serializing your threads with locks, or somehow slowing down execution in each thread.

 

Your basic setup looks fine. You can simplify it slightly by passing the nodes directly to your thread function:

 

    for (index = 0; index < 32; ++index)
    {
        CmtScheduleThreadPoolFunction(pool, My_Thread_Function, node_space[index], NULL);
    }

...

static int My_Thread_Function(void *functionData)
{
Node_Ptr nodes = functionData;
...

 But that's not going to affect performance.

 

Things to look into:

 

  1. Verify that you're really only working on one subset of the node space in each thread, that you're passing and receiving the correct node space in each thread and that you're working only on that one.
  2. Verify that you don't have any locks or other synchronization in your program. It sounds like you don't because you designed your program so that it wouldn't need any. But check anyway.
  3. Verify that you're not doing something unnecessary in your thread function. Sometimes people call ProcessSystemEvents or ProcessDrawEvents because they feel that it makes the UI more responsive. These two functions are expensive (around 20ms per call, I think). So if you are calling these functions in a loop, with a fixed total number of iteraction across all threads, and if the actual computations are relatively fast, then these functions can easily dominate the execution time of your program. (It need not be these functions, it might be other ones. These are just examples.)
  4. Show and explain your code to a colleague. Sometimes you don't see the obvious until you show it to someone. Or they might notice something.

Apart from that, can you explain what you are doing in your thread function so that we can have a better understanding of your program and what might inhibit parallelism?

Message 4 of 7
(4,518 Views)
Hi Peter, Many thanks for your timely response. The thread code is the working part of a 3D state machine (cellular automaton) and each node only needs to see it's immediate neighbours, so it has high locality. Your focus on what was happening inside the threads was spot on. I found that inside each thread, within the tight loop visiting each node, I was referring to values in another (global) structure and even though that structure was only about 1200 bytes, it must have been flushing cache or something. Anyway, I made them local variables, and now I get a speed up using multithreading. Now goes 2.5 times faster than a single thread. The best performing thread pool size for this problem seems to be 16, and padding the nodes to 64 bytes makes no difference. I'm getting 30% total CPU usage in task manager, so I think now it's limited by dynamic memory accesses. The code should vectorize well, and as a uni student I now have access to the Intel optimising compiler, to use as an external compiler for CVI. But I wanted to get the multithreading sorted before using this tool, otherwise you don't know what is causing the speed up. And besides, I now have a reason to upgrade to the new 6-core Intel CPUs (with 12 and 15MB caches). The larger L3 cache should fit the whole set of node blocks. :-) Cheers, Geoff.
0 Kudos
Message 5 of 7
(4,492 Views)
Hi J, Thanks for your quick response. 1. Sadly, I'm a student at a uni, and my supervisor isn't going to pay for any upgrades to CVI 2013. He reckons I should go to Visual Studio :-) It's a shame Labwindows and Labview aren't used more in unis, especially the engineering and science side of things. CVI is one of the most intuitive development environments I've come across. Much easier to use than VS. 2. Thread affinity didn't make any difference. But I did find the main problem was referring to two separate structures within a tight loop inside the threads (see reply to Peter below). Thanks again.
Message 6 of 7
(4,491 Views)

Hello again! We're very glad that it eventually worked out well for you and that you were able to find a solution to your problem. I wish you good luck with your university projects!

0 Kudos
Message 7 of 7
(4,470 Views)