We need to see how you are testing.
The front panel in your picture is completely meaningless (you don't even show all dimensions). And no, a hyper-threaded core (= 2 virtual cores) does not give a 2x speedup over a non-hyper-threaded code. You have four hyperthread cores, so 4x speedup over a single core is about what you can expect.
What functions are you testing? The masm functions operate on matrices directly, so why do you even need a loop? What are you doing with it? How is the parallel FOR loop configured? What are the execution properties of the testing VI? How do you measure the execution time?
We can compare the following:
- A multiplication of two large matrices using the regular matrix multiplication of the linear algebra palette
- A multiplication of two large matrices using the matrix multiplication of the MASM toolkit
- Maybe a homemade explicit matrix multiplication using simple primitives and comparing parallel/regular FOR loops.
Please show us your entire benchmarking code. Something is clearly wrong here.