For Loop “Iteration Parallelism”

Let’s say we are developing LabVIEW code to monitor the number of LabVIEW instances running at any given point of time over a LAN.  Then our code snippet will look like this.




In this code snippet, the “TCP Open connection” primitive will timeout after 1 second, and the loop runs 256 iterations. so the worst case execution time of the above snippet would be around 256 seconds.  The actual measurement shows the execution time to be around 225 seconds on a quad code machine running Windows 7.

Common questions that arise in our minds are:
  1.  Is there a way to reduce the execution time?
  2.  Can we improve the performance by making use of all the available CPU-cores?
The answer to both these questions is yes. Let’s see how we can achieve this.

From LabVIEW 2009 onward, The For Loop has a new feature called “Iteration Parallelism”. We enabled this feature in our code and played around with its settings to check for optimal settings. We were able to reduce to the execution time for our code from 225 seconds to just 4 seconds.

Steps to enable “Iteration Parallelism” option:
1.      Check for iteration parallelism in the for loop by using the tool which is under the menu “Tools >> Profile >> Find Parallelizable Loops
2.       If the loop is not parallelizable, Remove any data dependencies between the iterations. Shift registers and feedback nodes are the most common cause for data dependencies between loop iterations, so try to avoid them.
3.      In our case we have modified the code to separate the “iteration independent code” from the “iteration dependent code”.  After the modification our code will look like this. 



  The first “for loop” is Iteration Parallelizable unlike the second one because of its iteration dependent code.

4.      Right click on the border of the first for loop and select “Configure Iteration Parallelism”
The following dialog box will popup.


5.       In the above window we will refer to the “number of generated parallel loop instances” as “T”. The maximum value that can be entered for T is 64.
6.       Leave the “Allow debugging” unchecked and “Automatically partition iterations” selected.
7.      Check the “Enable Loop iteration Parallelism”, Click ok and the P terminal will appear on the For loop border.
8.      Create a control for this terminal and label it as “P”.
9.      The following table shows the results of the execution time of our code with various combinations of P and T.

P terminal input
Execution time with different values of “# of generated parallel loop instances” (T)
2
4
8
16
32
64
2
113.019s
124.996s
122.009s
124.005s
 119.005s
110.019s
Not wired
(on Quad core)
100.029s
61.005s
58.005s
61.017s
61.005s
53.781s
4
115.020s
60.003s
60.002s
 54.016s
49.004s
54.013s
8
114.012s
61.993s
28.010s
 28.019s
31.004s
 28.071s
16
102.030s
58.005s
26.011s
15.009s
14.008s
 14.024s
32
100.007s
57.008s
30.004s
15.011s
6.035s
 7.009s
64
123.003s
62.992s
29.013s
15.008s
7.009s
3.072s
128
122.002s
58.003s
29.007s
15.003s
7.010s
4.020s
256
118.008s
59.005s
30.007s
14.013s
7.034s
4.084s

If you observe carefully for any value of P> T, the execution time is almost same as the as when P=T.
Looking at the 2nd and 3rd row the execution time is almost the same. It indicates that if you don’t wire any input to P, LabVIEW will take 2 for dual core, 4 for a Quad core machine and so on.  So we removed the second row for plotting the variation of Execution time with respect to P and T.


Variation along each column


Variation along each Row                                            
It is clear that increasing the no. of threads (both T and P) reduces the execution time. We reduced the execution time from 256 seconds to about 4 seconds with the help of Iteration parallelism (T=64 and P>=64).

Conclusions:
1.      The Parameter “T” that is “Number of generated parallel loops instances” is the number of threads to be generated at compile time.
2.      The “P” terminal controls the number of threads to use at run time from the available Threads (T).
3.      If you don’t wire any input to P, LabVIEW will take 2 for dual core, 4 for a Quad core machine and so on. 
4.      Wiring any value above 64 has no impact on the performance.
5.      Since our example code is very small, creating 64 copies of the code will not consume too much memory, but if your code is big the chances of running out of memory is possible. In that case try reducing T.
6.      Our example code has a delay (as part of the Open TCP connection) so all the 64 threads were able to run in parallel so we saw a significant performance improvement. But most algorithms will not have any delay and will require the CPU for each of the 64 threads. In such cases, you will not observe a performance improvement greater than 2x or 4x, and there are even chances of performance degradation due to an overhead caused by thread handling.
7.      In our example using all of the 64 available threads gave us the best performance (3 seconds) but this may not always be the cases. The Main reasons for this being: limited available memory, no delays used in the code and the number of CPUs.
8.      If there is no delay inside the for loop, it is better to leave the P terminal unwired, so that LabVIEW can decide how many threads are to used based on the number of CPU cores in the target machine.


கருத்துகள்

பிரபலமான இடுகைகள்