LabVIEW Idea Exchange

WG- · ‎02-09-2012

LabVIEW should support loop unrolling...

For those who do not know what loop unrolling is (from wikipedia http://en.wikipedia.org/wiki/Loop_unwinding)

Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size (space-time tradeoff). The transformation can be undertaken manually by the programmer or by an optimizing compiler.

The goal of loop unwinding is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration;^[1] reducing branch penalties; as well as "hiding latencies, in particular, the delay in reading data from memory".^[2] Loops can be re-written instead as a repeated sequence of similar independent statements eliminating this overhead.^[3]

Example with textual programming language:

int x;
for (x = 0; x < 100; x++)
{
delete(x);
}

Becomes:

int x;
for (x = 0; x < 100; x+=5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}

Who is with me 😉

RavensFan · ‎02-09-2012

I don't understand what is supposed to be the benefit of the second loop compared to the first.

That said, what you are showing in the 2nd loop looks an awfully lot like parallellism in For Loops that has been present in LabVIEW for a few versions now.

WG- · ‎02-09-2012

I didn't provide the wiki link for nothing you know No offence but if you don't know what loopunrolling is ánd don't read the documentation I provide then don't bother to comment either.

Further to brighten your mind this is totally different from data parallelization. See the example it can't, well I can't, be explained more simpler then that.

PhillipBrooks · ‎02-09-2012

I would have to agree with Ravens Fan.

Improving Performance with Parallel For Loops

Now is the right time to use %^<%Y-%m-%dT%H:%M:%S%3uZ>T
If you don't hate time zones, you're not a real programmer.
"You are what you don't automate"Inplaceness is synonymous with insidiousness

WG- · ‎02-09-2012

How clear do you want to get it? Parallelization != Loop unrolling *sigh* get yourself educated please.

Mr.Mike · ‎02-09-2012

We currently do loop unrolling in the compiler. Were you suggesting we allow loop unrolling at the source level (i.e. on the block diagram)? Or if you wanted to suggest that we do it in the compiler, maybe the suggestion should be to visualize that a loop will be unrolled?

Some compiler optimizations are already visualized (e.g. constant folding. Apparently we somehow visualize dead code elimination, but I've never seen it). Others are hard to visualize without adding too much clutter.

(Incidentally, you can't use a parallel for loop for deleting elements in an array if you're deleting by index: the index may change under you. Imagine if you were to do this with a sequential for loop: you'd need a shift register, which isn't allowed in a parallel for loop. If you were to do it with a data value reference, you can see that the items deleted vary with each run)

-- Mike

Mr.Mike · ‎02-09-2012

In general, the benefit of loop unrolling is that it decreases jump and compare instructions. A jump is an instruction that tells the processor to move to an instruction other than the next one. A compare is an instruction that tells the processor to compare two values, and depending on the result execute one instruction or continue. So a for loop (very loosely) becomes:

init i to 0
compare i < end: if i > end jump to 6
delete arr[i]
i += 1
jump to 2
clean up

(There are other ways to express this, such as putting a compare at the end of the loop)

If you set end to 100, you get 101 compares and 100 jumps. If you unroll the loop as WouterG showed, you get:

init i to 0
compare i < end: if i > end jump to 10
delete arr[i]
delete arr[i+1]
delete arr[i+2]
delete arr[i+3]
delete arr[i+4]
i += 5
jump to 2
clean up

(Note: I'm not addressing the issue of dealing with ends that are not a multiple of the unrolling factor.)

This has only 21 compares and 20 jumps. Compares and jumps tend to slow the processor down because it can't always predict in advance which direction you'll go in, so it can't start "pipelining" the instructions to make them execute faster. There are additional benefits to loop unrolling like limiting the number of times you have to run the loop initialization code inside of the loop (not shown in my instructions above).

-- Mike

PhillipBrooks · ‎02-09-2012

NI LabVIEW Compiler: Under the Hood

Now is the right time to use %^<%Y-%m-%dT%H:%M:%S%3uZ>T
If you don't hate time zones, you're not a real programmer.
"You are what you don't automate"Inplaceness is synonymous with insidiousness

WG- · ‎02-10-2012

@MrMike I want to suggest that users are able to decide for them selfs how much a loop should be unrolled. So for examle by a for-loop next to the amount of times the loop should be executed and how many of them should run parallel you should be able to tell how many times the loop must be unrolled. That the compiler currently does this is nice but does not mean that it chooses the best number of unrolling, a programmer should decide that. The best number of unrolling depends on the system you are running, hence it is also a more iterative optimization. You first try a loop unrolling of 2, 4, 5, 8, 10, 15, 16. Often always a multiply of 2 or 5. After you can see which one executes the fastest.

Daklu · ‎02-11-2012

I'm neither for nor against the idea, but I'm curious about your use case. You must be working with huge arrays or under very tight timing constraints if the difference of relatively small number of jump/compare instructions is critical to your code.

Have you tried manually unrolling your loops to compare the difference in execution time? Granted, the compiler might add its own unrolling algorithms on top of yours, but I'm curious about the results.

Andrey_Dmitriev · ‎01-10-2013

@WouterG wrote:

Who is with me 😉

I'm with you. In some cases user-controlled unrolling may be useful (even that the LabVIEW compiler does this internally already).

Intel Compiler have unroll pragma:

(refer to unroll[n] pragma)

Suggestion for LabVIEW:

for example, manual unroll 8 times:

As you can see this code above cannot be parallelized because dependence between loop iterations exists.

With LabVIEW manual unrolling may give up to 30% performance increment:

Example is here (created with LabVIEW 2012 x64, tested on PC with 12 GB RAM).

Andrey.