Data Races: What They Are, How to Fix Them Share your comment!

race

I have talked a lot about the parallelization of loops using OpenMP. It is an easy way to improve performance in your applications, especially if you can apply the technique to loops that happen often or loops with many iterations. In many cases, OpenMP provides optimized performance with no down-side risks. But there are other cases when parallelizing with OpenMP can introduce inconsistencies in calculations.Let me explain.

Consider the following loop. It sums up the values of a number of array elements into a single variable. The code loops through 100,000,000 times, and in each loop iteration the value of that particular array element is added to the sum. This function returned a sum of -55,500,317.642823 and in release mode executed in 1092 milliseconds. I ran my test program several times and got the exact same sum, but with small variations in the elapsed time. The small variations in the elapsed time can be explained by the coarse granularity of the GetTickCount() windows API function that I used for timing the findSum() function.

Parallelizing the Loop
Now to our parallelization trick: we will use an OpenMP compiler directive to parallelize the loop and improve performance. To do this, all we need to do is decorate the for loop with the directive #pragma omp parallel for. The following code shows the updated version of the findSum() function.

I started to pat myself on the back when I saw that the execution time was down to 284 milliseconds. This is 26 percent of the original execution time, a very significant improvement. Then my bubble was burst when I saw that the sum was now 27364787.927133, not the same as the sum I originally got. I ran the program again and got yet another answer. This time it was 1985173.656403. This anomaly obviously negates the performance gain since computer algorithms always need an accurate and reproducible outcome.

Data Race as the Explanation
The phenomenon that I experienced is known as a data race. This occurs when more than one thread is simultaneously updating a single memory location. Simultaneous memory updates produce inconsistent results. That is because it is completely undetermined which thread was the last to update the memory location. A variation of several picoseconds could have thread A writing last, or thread B writing last. For this reason, data races must be dealt with in order to get accurate results.

The way to solve a data race is to insure that only a single thread accesses a memory location at any given time. It is imperative to avoid simultaneous access of the same memory location. We can solve this in a fairly intuitive way. First, we can specify the number of threads that the loop will use by adding to the compiler directive as follows: #pragma omp parallel for num_threads(4). This will give us the information we need to create an array for separate sums. We will have a sum value for each thread. Next, we use a sum array with four elements instead of the single sum variable. Finally we use an index into the sum array provided by the OpenMP omp_get_thread_num() function. Once the loop is complete, the four elements in the sum array must be added in order to get the overall sum. The following code shows the updated findSum() function. The updated function returned the results we hoped for, -55,500,317.642823.

Reducers to the Rescue
The code we used to solve the data race issue worked. The function consistently returned the value of -55,500,317.642823, the same sum that the findSum() function returned before we parallelized the loop.
This solution seems like a lot of work, and adds some complexity to the code readability. For this reason, OpenMP provides a mechanism that can help you solve data race conditions known as a reducer. Instead of creating the array of sum values and then adding them together after the loop, you can simply add a reducer clause to the compiler directive to let OpenMP know that you are facing a data race condition, and which variable must be dealt with. Now, the compiler directive becomes #pragma omp parallel for reduction(+:sum). This solves the data race issue in a much simpler way as the following code shows.

Conclusion
Data race conditions can potentially render parallelized code useless. You can solve the problem yourself, or you can use an OpenMP reducer. Either way, the problem can be solved.

Posted on January 6, 2017 by Rick Leinecker, Slashdot Media Contributing Editor