01-06-2016 03:53 AM
Hi muh,
You see testing circuit.
There is no circuit. All I see is an image from a block diagram…
Let me re-itterate, it is simple enough : does memory resize with change 0 slow down excecution or not. Yes/no.
Your whole example is flawed: When I time your "circuit" I get the very same timing values regardless of the loop iteration count! This tells me: the compiler recognized your routine as "constant code" and precomputed the result of your add function! Your question makes no sense, as it doesn't apply for that example!
01-06-2016 05:30 AM
"When I time your "circuit" I get the very same timing values regardless of the loop iteration count!"
When I do the same, I get results proportional to the count. Labview 2013, debugging is off.
Once again the question is not how this particular diagram works. The question is - does resize with change 0 takes a lot of time? It seems to me that you do not know the answer. Thank you for trying, though.
01-06-2016 05:33 AM
01-06-2016 05:35 AM
Thank you.
01-06-2016 11:34 AM
@muh1 wrote:
"Once again the question is not how this particular diagram works. The question is - does resize with change 0 takes a lot of time? It seems to me that you do not know the answer. Thank you for trying, though.
No need to get snippy. If you show us seriously flawed rube goldberg code with a vague question, we assume that you want more than just a YES/NO answer. Quoting your own words: "it depends". 😄
Why don't you grab paper and pencil and calculate the nunber of clock cycles per multiplication. Since this is simple math and indepdendent of any programming language, you should have no problem with that. Have you done that?
Here is a good article about the LabVIEW compiler. Learn about folding and inplaceness. You'll also learn that LabVIEW is very good at re-using buffers if the size does not change. Benchmarking code is more an art than a science and 95% of all performance measurements posted in this forum are highly flawed, including yours. So if your question is based on flawed assumptions, we feel it is important to first fix the benchmark.
Next time you have a similar question, have the decency to include a small VI that is ready to run and contains reasonable default data in all controls and also includes the timing code. We cannot debug a picture. Also avoid as many diagram constants as you can (Size, N, etc.), because false answers due to folding are a real possibility if you don't.
01-06-2016 11:51 AM
muh1 wrote:
Let me ask you a simple question - do you have an answer to my original one or not. Let me re-itterate, it is simple enough : does memory resize with change 0 slow down excecution or not. Yes/no. Secondary questions - why there is memory buffer reallocation at the input of Add of two arrays of same size? Is 4 msec resonable time for adding two array of 1e6 complex dbl numbers? I prefer not to pointlessly discuss digramm, which I introduced just to illustrate the questions. The questions are self-containing,
Any time LabVIEW needs to make a copy of an array or other large data structure, there's a performance penalty. It doesn't matter whether the array is being resized, only whether that operation forces a copy. Your code here forces LabVIEW to make a copy of the input array, even though the size doesn't change.
The reason you see a memory buffer allocated at the input of the add is that the add node reuses the upper input buffer as the output buffer. In your example code, the original buffer cannot be modified, because it's needed for later loop iterations, so LabVIEW must make a copy of that original buffer to be used for the output. That new buffer can be reused on each loop iteration, so it's only allocated once, but the copy must occur on every iteration. Allocating memory is fast, copying data is slow.
01-06-2016 12:15 PM
As far as I understand, number of clock cycles per floating addition can be anywhere in several hundreeds for early 8086 processors to less then 1 for pipelined modern processors. I do not know exactly what fraction of a cycle. For example, if vector instructions of the processor are used, the time drops by factor of several. I can find out, of course, but I've tried to see whether collective wisdom can help.
Actually time is, probably, about right. I was a bit confused by the fact, that the profiler did not report any significant time for "vector outer product" VI, which involves same number of multiplications, and should take at least same time as addition of resulting matrices. Maybe a profiler glitch.
I am sorry, but diagramm I've included actually runs and takes 4 second to complete on my laptop, scaling roughly linearly with number of loop iteration. I am not trying to fully becnmark anything. I was trying to see whether this is entirely unresonable. I've seen examples of compiler optimisation and sort of made sure it does not ocure here. Finally, I had no intention of burdening the participants with debugging my code, just a simple question.
Thank you for the reference to the article, I'll look into it.
01-06-2016 12:23 PM
Aha. Thank you. That must be it. I should have thought of this myself.
01-06-2016 02:12 PM - edited 01-06-2016 02:37 PM
@muh1 wrote:
As far as I understand, number of clock cycles per floating addition can be anywhere in several hundreeds for early 8086 processors to less then 1 for pipelined modern processors. I do not know exactly what fraction of a cycle. For example, if vector instructions of the processor are used, the time drops by factor of several. I can find out, of course, but I've tried to see whether collective wisdom can help.
The outer product is quite different to adding two 1D arrays. You'll end up with a matrix that has as many elements as the product of input arrays. Two things can speed up the code: (1) It will take advantage of SSE instructions operating on several adjacent elements at once. Most likely the outer product will use the intel math kernel libary, so it should be quite optimized out of the gate. (2) In addition, you can take advantage of multiple CPU cores. The outer product is inlined (at least in LabVIEW 2015), so there is little calling overhead, but it does not take advantage of multiple cores. Most likely it re-uses its internal data structures as long as the input sizes remain the same. So no need for constant allocations.
Here are a few possibilities to calculate the outer product:
Also note that for small problems, the parallelization overhead could cancel the parallelization advantage, so it is important to do a valid benchmark. Your I5 is most likely a non-hyperthreaded quad code. What is the exact model number? 2.1GHz seems low. Is this a mobile processor?
@muh1 wrote:
I am sorry, but diagramm I've included actually runs and takes 4 second to complete on my laptop, scaling roughly linearly with number of loop iteration.
It is well possible that LabVIEW 2013 behaves differently and does not regcognize the loop invariant code (hard to believe, though). Did you enable the display of folding in the LabVIEW options? You said that you disabled debugging on the VI. Can you double-check that. Do you really have a diagram constant wired to N or is it now a control? If all inputs are constants, everything will be calculated at compile time and replaced with the resulting constant under ideal condition.
01-07-2016 03:55 AM
Of course it is different. However, outer product of two 1000 arrays involves same number of multiplication as summing the resulting matrixes requires additions. So I expected it to take at least same time. Maybe it does, but profiler somehow misses it.
As for the processor, i5 I awas using is one of them. indeed mobile. Mostly I was playing in it while a desktop i7 was doing real work.
Your link to parallelized version of outer product leads to vi's I was actually using. It does not say they are parallelized. Are they? That may be the reason why the profiler does not catch their execution time.
Thank you for your help, I think I am pretty much done with this problem for now.