What is the Effect of Simultaneous OpenMP Loops? Share your comment!

OpenMP simplifies code parallelization, but can one overdo their use of this valuable tool? In the blog Slashdot Media Contributing Editor Rick Leinecker creates some gnarly code to see if it creates a performance hit

I have spent a lot of time here at Go Parallel talking about OpenMP loops. The OpenMP standard provides simple compiler directives that allow you to easily and effectively parallelize loops. Many of the programs I write not only parallelize loops, but split tasks into separate threads. Sometimes my programs have two controlling threads, each of which contains parallelized loops. There are times when the parallelized loops occur at the same time, and I have often wondered what happens to performance. My concern is that the loops are each using several threads, and now those threads split the system cores between two OpenMP loops. This article explores this question, and arrives at some conclusions.
Creating Gnarly Functions
To get started in this exploration, I created two functions. Each function performs some gnarly math in a loop. Each loop iterates 10,000,000 times. The two functions can be seen below.

Compiled in release mode, the DoGnarlyMath1() function executed in 18,578 milliseconds, and the DoGnarlyMath2() function executed in 18,562 milliseconds. You can see this in the following screen capture.

rickpick1

Parallelizing the Loops
We can go from pedestrian, to eloquent by adding OpenMP pragma directives. This will let the compiler know that we want to parallelize the loops. The following code is the parallelized code.

The execution time was far less after parallelizing the loops as you can see in the next screen capture. Also please note that these functions were executed on after the other. This is different than the thread approach that we will take shortly.
// Calling the functions sequentially
DoGnarlyMath1();
DoGnarlyMath2();

rickpick2
Separate Task Loops
Now the rubber hits the road. I called the DoGnarlyMath1() function from one thread, and the DoGnarlyMath2() function from another. The code can be seen below. Please note that I had to alter the function calls slightly so that they conformed to a Windows thread procedure.

Since each thread is sharing the cores of my system, the results showed that they took twice the time as when each of the functions was called separately. You can see the results shown in the screen capture below.

rickpick3
Conclusion
The OpenMP standard provides an incredible mechanism for parallelizing loops. But the magic can be diluted when OpenMP loops are executed simultaneously.

Posted on January 3, 2017 by Rick Leinecker, Slashdot Media Contributing Editor