For Loop “Iteration Parallelism”
Let’s say we are developing LabVIEW
code to monitor the number of LabVIEW instances running at any given point of
time over a LAN. Then our code snippet will look like this.
In this code snippet, the “TCP Open
connection” primitive will timeout after 1 second, and the loop runs 256
iterations. so the worst case execution time of the above snippet would be
around 256 seconds. The actual measurement shows the execution time to be
around 225 seconds on a quad code machine running Windows 7.
Common questions that arise in our
minds are:
- Is
there a way to reduce the execution time?
- Can
we improve the performance by making use of all the available CPU-cores?
The answer to both these questions
is yes. Let’s see how we can achieve this.
From LabVIEW 2009 onward, The For Loop
has a new feature called “Iteration Parallelism”. We enabled this feature in
our code and played around with its settings to check for optimal settings. We
were able to reduce to the execution time for our code from 225 seconds to just
4 seconds.
Steps to enable “Iteration
Parallelism” option:
1.
Check for iteration parallelism in
the for loop by using the tool which is under the menu “Tools >> Profile >>
Find Parallelizable Loops
2.
If
the loop is not parallelizable, Remove any data dependencies between the
iterations. Shift registers and feedback nodes are the most common cause for data
dependencies between loop iterations, so try to avoid them.
3.
In our case we have modified the
code to separate the “iteration independent code” from the “iteration dependent
code”. After the modification our code will look like this.
The first “for loop” is Iteration
Parallelizable unlike the second one because of its iteration dependent code.
4.
Right click on the border of the
first for loop and select “Configure Iteration Parallelism”
The following dialog box will popup.
5.
In
the above window we will refer to the “number of generated parallel loop
instances” as “T”. The maximum value that can be entered for T is 64.
6.
Leave
the “Allow debugging” unchecked and “Automatically partition iterations”
selected.
7.
Check the “Enable Loop iteration
Parallelism”, Click ok and the P terminal will appear on the For loop border.
8.
Create a control for this terminal
and label it as “P”.
9.
The following table shows the
results of the execution time of our code with various combinations of P and T.
P
terminal input
|
Execution
time with different values of “# of generated parallel loop instances” (T)
|
|||||
2
|
4
|
8
|
16
|
32
|
64
|
|
2
|
113.019s
|
124.996s
|
122.009s
|
124.005s
|
119.005s
|
110.019s
|
Not wired
(on Quad core)
|
100.029s
|
61.005s
|
58.005s
|
61.017s
|
61.005s
|
53.781s
|
4
|
115.020s
|
60.003s
|
60.002s
|
54.016s
|
49.004s
|
54.013s
|
8
|
114.012s
|
61.993s
|
28.010s
|
28.019s
|
31.004s
|
28.071s
|
16
|
102.030s
|
58.005s
|
26.011s
|
15.009s
|
14.008s
|
14.024s
|
32
|
100.007s
|
57.008s
|
30.004s
|
15.011s
|
6.035s
|
7.009s
|
64
|
123.003s
|
62.992s
|
29.013s
|
15.008s
|
7.009s
|
3.072s
|
128
|
122.002s
|
58.003s
|
29.007s
|
15.003s
|
7.010s
|
4.020s
|
256
|
118.008s
|
59.005s
|
30.007s
|
14.013s
|
7.034s
|
4.084s
|
If you observe carefully for any
value of P> T, the execution time is almost same as the as when P=T.
Looking at the 2nd and 3rd
row the execution time is almost the same. It indicates that if you
don’t wire any input to P, LabVIEW will take 2 for dual core, 4 for a Quad core
machine and so on. So we removed the second row for plotting the
variation of Execution time with respect to P and T.
Variation along each column
Variation along each Row
It is clear that increasing the no.
of threads (both T and P) reduces the execution time. We reduced the execution
time from 256 seconds to about 4 seconds with the help of Iteration parallelism
(T=64 and P>=64).
Conclusions:
1.
The Parameter “T” that is “Number of
generated parallel loops instances” is the number of threads to be generated at
compile time.
2.
The “P” terminal controls the number
of threads to use at run time from the available Threads (T).
3.
If you don’t wire any input to P, LabVIEW
will take 2 for dual core, 4 for a Quad core machine and so on.
4.
Wiring any value above 64 has no
impact on the performance.
5.
Since our example code is very small,
creating 64 copies of the code will not consume too much memory, but if your
code is big the chances of running out of memory is possible. In that case try
reducing T.
6.
Our example code has a delay (as
part of the Open TCP connection) so all the 64 threads were able to run in
parallel so we saw a significant performance improvement. But most algorithms
will not have any delay and will require the CPU for each of the 64 threads. In
such cases, you will not observe a performance improvement greater than 2x or
4x, and there are even chances of performance degradation due to an overhead
caused by thread handling.
7.
In our example using all of the 64 available
threads gave us the best performance (3 seconds) but this may not always be the
cases. The Main reasons for this being: limited available memory, no delays
used in the code and the number of CPUs.
8.
If there is no delay inside the for
loop, it is better to leave the P terminal unwired, so that LabVIEW can decide
how many threads are to used based on the number of CPU cores in the target machine.
கருத்துகள்
கருத்துரையிடுக