For Loop “Iteration Parallelism”

ஆகஸ்ட் 27, 2013

For Loop “Iteration Parallelism”

Let’s say we are developing LabVIEW code to monitor the number of LabVIEW instances running at any given point of time over a LAN. Then our code snippet will look like this.

In this code snippet, the “TCP Open connection” primitive will timeout after 1 second, and the loop runs 256 iterations. so the worst case execution time of the above snippet would be around 256 seconds. The actual measurement shows the execution time to be around 225 seconds on a quad code machine running Windows 7.

Common questions that arise in our minds are:

Is there a way to reduce the execution time?
Can we improve the performance by making use of all the available CPU-cores?

The answer to both these questions is yes. Let’s see how we can achieve this.

From LabVIEW 2009 onward, The For Loop has a new feature called “Iteration Parallelism”. We enabled this feature in our code and played around with its settings to check for optimal settings. We were able to reduce to the execution time for our code from 225 seconds to just 4 seconds.

Steps to enable “Iteration Parallelism” option:

1. Check for iteration parallelism in the for loop by using the tool which is under the menu “Tools >> Profile >> Find Parallelizable Loops

2. If the loop is not parallelizable, Remove any data dependencies between the iterations. Shift registers and feedback nodes are the most common cause for data dependencies between loop iterations, so try to avoid them.

3. In our case we have modified the code to separate the “iteration independent code” from the “iteration dependent code”. After the modification our code will look like this.

The first “for loop” is Iteration Parallelizable unlike the second one because of its iteration dependent code.

4. Right click on the border of the first for loop and select “Configure Iteration Parallelism”

The following dialog box will popup.

5. In the above window we will refer to the “number of generated parallel loop instances” as “T”. The maximum value that can be entered for T is 64.

6. Leave the “Allow debugging” unchecked and “Automatically partition iterations” selected.

7. Check the “Enable Loop iteration Parallelism”, Click ok and the P terminal will appear on the For loop border.

8. Create a control for this terminal and label it as “P”.

9. The following table shows the results of the execution time of our code with various combinations of P and T.

P terminal input	Execution time with different values of “# of generated parallel loop instances” (T)
	2	4	8	16	32	64
2	113.019s	124.996s	122.009s	124.005s	119.005s	110.019s
Not wired (on Quad core)	100.029s	61.005s	58.005s	61.017s	61.005s	53.781s
4	115.020s	60.003s	60.002s	54.016s	49.004s	54.013s
8	114.012s	61.993s	28.010s	28.019s	31.004s	28.071s
16	102.030s	58.005s	26.011s	15.009s	14.008s	14.024s
32	100.007s	57.008s	30.004s	15.011s	6.035s	7.009s
64	123.003s	62.992s	29.013s	15.008s	7.009s	3.072s
128	122.002s	58.003s	29.007s	15.003s	7.010s	4.020s
256	118.008s	59.005s	30.007s	14.013s	7.034s	4.084s

If you observe carefully for any value of P> T, the execution time is almost same as the as when P=T.

Looking at the 2^nd and 3^rd row the execution time is almost the same. It indicates that if you don’t wire any input to P, LabVIEW will take 2 for dual core, 4 for a Quad core machine and so on. So we removed the second row for plotting the variation of Execution time with respect to P and T.

Variation along each column

Variation along each Row

It is clear that increasing the no. of threads (both T and P) reduces the execution time. We reduced the execution time from 256 seconds to about 4 seconds with the help of Iteration parallelism (T=64 and P>=64).

Conclusions:

1. The Parameter “T” that is “Number of generated parallel loops instances” is the number of threads to be generated at compile time.

2. The “P” terminal controls the number of threads to use at run time from the available Threads (T).

3. If you don’t wire any input to P, LabVIEW will take 2 for dual core, 4 for a Quad core machine and so on.

4. Wiring any value above 64 has no impact on the performance.

5. Since our example code is very small, creating 64 copies of the code will not consume too much memory, but if your code is big the chances of running out of memory is possible. In that case try reducing T.

6. Our example code has a delay (as part of the Open TCP connection) so all the 64 threads were able to run in parallel so we saw a significant performance improvement. But most algorithms will not have any delay and will require the CPU for each of the 64 threads. In such cases, you will not observe a performance improvement greater than 2x or 4x, and there are even chances of performance degradation due to an overhead caused by thread handling.

7. In our example using all of the 64 available threads gave us the best performance (3 seconds) but this may not always be the cases. The Main reasons for this being: limited available memory, no delays used in the code and the number of CPUs.

8. If there is no delay inside the for loop, it is better to leave the P terminal unwired, so that LabVIEW can decide how many threads are to used based on the number of CPU cores in the target machine.

இந்த வலைப்பதிவில் தேடு

My-LabVIEW

For Loop “Iteration Parallelism”

கருத்துகள்

கருத்துரையிடுக

பிரபலமான இடுகைகள்

Basics of ASCII;Unicode; File IO; using LabVIEW