Unintended Consequences of Parallel Computing: How to Avoid Pitfalls Share your comment!

In this blog, we’ll take an in-depth look at parallel computing, and examine some of the pitfalls that can thwart efforts for great code to create applications that sing.

Parallel Is Not Always Faster Than Sequential

There are times when changing code from sequential to parallel will slow it down. While this may seem surprising, if you think about it for a few minutes, it will make sense. Let’s say, for instance, that you have a loop that iterates 10 times. Now let’s say that you use OpenMP to parallelize the loop. The overhead that OpenMP spends creating the multiple threads will usually be far larger than the benefits of the parallelization. Some examples follow.
When the sequential TestMe() function is called 1,000,000 times on a 3.4 Ghz i7 it takes 3.07 seconds.

// Called 1,000,000 takes 3.07 seconds
void TestMe( void )
{
for( int i=0; i<10; i++ )
{
InitialValues[i] = cos( (double)i );
}
}

If we want to parallelize the loop with OpenMP, we simply add #pragma omp parallel for as follows. When the refactored function is called 1,000,000 times, it takes 32.937 seconds. The obvious reason is that the overhead of creating the threads each time that the function is called costs far more than the savings of running the loop in parallel.

// Called 1,000,000 times takes 32.937 seconds
void TestMe( void )
{
#pragma omp parallel for
for( int i=0; i<10; i++ )
{
InitialValues[i] = cos( (double)i );
}
}

Note that the TestMe() function is called 1,000,000 times and the loop iterates 10 times, for a total of 10,000,000 units of work. If we restructure this so that the number of loop iterations is 1,000,000 and the TestMe() function is called 10 times, the result is completely different. Calling the TestMe() function 10 times, which then loops 1,000,000 executes in .156 seconds.

// Called 10 times takes .156 seconds
void TestMe( void )
{
#pragma omp parallel for
for( int i=0; i<1000000; i++ )
{
InitialValues[i] = cos( (double)i );
}
}

This leads us to the conclusion that constructs which do not iterate many times are not worth the cost to parallelize, while constructs that iterate many times are worth the cost. The Maxim: Beware the short parallelized loop!

Avoid Writing to Shared Memory, or at Least Mitigate the Effect

When multiple threads write to the same memory location, a race condition can occur. This is because it is not certain which thread wrote to the memory location last, producing uncertain results. For instance, if thread number one writes the value of 17 to a shared variable, and thread number two writes the value of 24 to a shared variable, the result is indeterminate. That is because without a synchronization mechanism, the value can be either 17 or 24 depending on which thread wrote last.

There are synchronization mechanisms to deal with race conditions, but they slow performance down. Other methods to mitigate race conditions should be found if possible. The first and safest, is to restructure the code so that only one thread writes to a memory location, or at least one thread writes to a memory location at a totally different time than all other threads.

Another way around code that causes race conditions is to have a temporary variable in thread local storage. Once all of the threads have completed their updates, they can be combined, and the shared memory location updated. OpenMP has just such a mechanism for parallel loops. The following code implements OpenMP to parallelize a loop, but uses a temporary variable (known as a reduction in OpenMP speak) for each thread that is combined once all threads are completed. The reduction clause tells OpenMP to use a temporary variable for each of the threads, and then combine them when all of the threads are complete.

double dSum = 0;
#pragma omp parallel for reduction(+:dSum)
for( int i=0; i<1000000; i++ )
{
dSum += cos( (double)i );
}

The maxim: tread with caution when accessing shared memory from multiple threads.

Do Not Over-Parallelize

There is always a temptation to parallelize nested loops. It only stands to reason that if parallelizing the outer loop adds performance, then parallelizing the outer and the inner loops will also provide gains. But this is not necessarily true. Let’s say, for instance, that the outer loop splits into four threads on a quad core machine. Then let’s say that the inner loop splits into another four threads. The fact then becomes that you have four threads per out loop thread for a total of 16 spawned threads. When this happens on a quad-core machine, it means that each core now has four threads assigned to each core as a result of your code. There is no way that this will improve performance since each core will still execute the same number of CPU cycles, regardless of the number of threads assigned to it.

There are exceptions to this warning. The first is when the inner loop requires a long time to execute. When this is true, then the inner loop will split into threads, each of which has a core affinity. The second is when the computations are expensive. Here again, this will cause the inner loop threads to execute with core affinities. And the third reason to throw caution to the wind is when you are targeting a system that has enough cores to supply the inner and outer loops. This leads us to the maxim in this section: carefully calculate the effect of parallelizing nested loops, and over parallelizing in general.

Beware of Compiler Optimizations

The compiler is almost always your best friend. I was once chastised by a Microsoft Engineer for writing inline assembly to try to optimize a function. He categorically told me that I would not be able to outrun the compiler. And for almost all cases, he is right.

But there is one case I would like to point out where a compiler optimization may not be safe for parallel code. The compiler may at times provide an optimization where a shared variable’s contents is kept in a register for faster access. The variable may only be updated every so often from the value in the register. When this is the case, the value in the variable may not be an accurate reflection of the true value since the true value might be in a register. Another thread might read the variable and get the incorrect value. Here is one distinct case when compiler optimization can introduce a serious bug in parallelized code.

There is a way to tell the compiler that a variable must always be updated. If you use the volatile keyword, then the compiler will respect the variable and make sure that it is updated if its value changes. The maxim: consider using the volatile keyword for variables that are accessed from multiple threads.

Conclusion

Writing parallel code can introduce a number of unintended effects. This blog should help you gain an understanding of the most common issues, including race conditions and not over-parallelizing your code, and how to approach fixing them.

Posted on May 5, 2015 by Rick Leinecker