Chunk Power: Using Threading Building Blocks for Parallel Reduction in Mathematical Algorithms 2 Comments on Chunk Power: Using Threading Building Blocks for Parallel Reduction in Mathematical Algorithms

Over the past several weeks, I’ve covered reduction extensively using Threading Building Blocks (TBB). Part of that discussion involves breaking up an array into blocked ranges. This week I’m going to show you an actual application of all this using mathematics. We’ll look at how these concepts can be used to calculate a definite integral numerically. 

One way to do so to note that the value for the definite integral is the area under the function curve within the bounds of the integral. If you imagine dividing that shape into hundreds of small rectangles, you can determine the area simply by adding the area of the rectangles. To make this work, we’ll define the height of the rectangle as the value of the function in the horizontal center of the rectangle. If we divide up the area into hundreds of such rectangles, we can get a close number. Divide it into thousands, we’ll get closer, millions and you’ll get even closer yet.

But before we get started I need to mention I’ll be using the double-precision floating point in our C++ threading, not perfect but workable for the precision needed here. If there’s interest from readers, perhaps later we can explore using higher-precision libraries with parallel programming. But today we’ll stick to double, which means the numbers we end up with might not be quite as good as we’d like.

For example, at first I was going to demonstrate the calculation of pi using the definite integral 4 / ( 1 + x^2) from 0 to 1. But I quickly hit some precision problems. The smaller you make the rectangles, the larger the magnitude difference between the width of the rectangle (which might be, for example, .0000000001) and the function value (which, in this case might be around 4.0). When you multiply those numbers, you lose precision, and can end up with a number that bears no resemblance to the actual, exact number. That’s what happened with the pi function. There are certainly ways around this, including using a higher-precision library. But for now let’s keep it simple. So I decided to switch to a different function that works a bit better.

Let’s use the (sin x) / x, taking the integral from 0 to 20. You can inspect this function using Wolfram Alpha. If you create an account there, you can also find the value to many digits, which we’ll use as a comparison for our calculations:



Here’s the serial version of the code that I put into Parallel Composer. I’ll create my rectangles of width 0.0000001.

double mathfunc(double x) {
return sin(x) / x;

double mathfuncArea(double x, double width) {
return width * mathfunc(x + width / 2);

void integrate_serial(double width) {
double area = 0;
double end = 20;
for (double i=0; i

The entry point is integrate_serial, which calls the other functions. When I run this on my computer, it took 3458 milliseconds, and the value I got was 1.5482417032606.

Now let's create a parallel implementation. Remember, ranges don't need to be arrays. Indeed, the blocked_range template just keeps track of a starting number and an ending number and can be any type as long as it supports the arithmetic operations. In our case, it will be a starting double and an ending double. Here's our reduction template class; it will call the same mathfunc and mathfuncArea functions as before:

class Integral {
double value;
Integral() : value(0) {}
Integral( Integral &i, split) { value = 0; }
void operator()( const blocked_range& r ) {
double temp = value;
for (double i=r.begin(); i

This works almost the same as the parallel_for I've covered in the past. But look at the loop. Instead of simply adding 1, I add the tiny double value. And I call the mathfuncArea function just like before.

Here's code to try it out:

void integrate_parallel() {
Integral total;
parallel_reduce(blocked_range(0.0, 20.0, 1.0), total);
std::cout << total.value << std::endl; }

I create a new instance of my Integral object, and call the parallel_reduce function template. I specify the atomic type used throughout (double), the start and end points, and the grain size. Notice the grain size; this isn't something like 10,000 like before. Instead, it's a real number spanning the size of each grain. I chose 1.0, which means I'll ultimately have 20 blocks; feel free to try some other values yourself to see where you get the best thread optimization

The first time I ran this it went much faster than the serial version, at 843 milliseconds, about a four times improvement. Pretty good. As for the final value, it was very close but not exactly the same as the serial version:


The likely reason for this is rounding and precision errors. We're not adding the tiny slices in the same order as before. When dealing with round-off errors in precision numbers, the associative rules of additional don't always apply, as different round-off errors get introduced at different times. So again, we'd be better off with an external high-precision floating point library.

Bottom Line: parallel_reduce Provides Big Improvement, Less Precision

That's how you use parallel_reduce on a mathematical formula instead of an array. We saw a significant improvement. But the number we got wasn't exactly the same; that's likely due to rounding and precision errors. If there's interest, in the future we can work in a high-precision library.

What do you think? Have you had any luck using high-precision libraries and Threading Building Blocks? I'd love to hear from you in the comments below.

Posted on October 29, 2012 by Jeff Cogswell, Geeknet Contributing Editor
  • Jos van Eijndhoven

    Why would you move to high-precision for the parallel case? The parallel version is just as precise as the original sequential version. It is NOT less precise as you state in Bottom Line: The original version was also inaccurate, maybe you merely didn’t notice.

  • Pawel

    Hi, how many cores have you used on your machine – 4? Have you tried parallelization in Java? I was heavily surprised when the performance improvement of using two threads over one was only around 50%, 3rd improved by 20% and 4th did nothing. I have tested it on a simple counting loops, even with disjoint data (they were counting on their own). That might suggest the task scheduling is not very effective or more likely I don’t know how to implement it efficiently :). Any suggestions would be helpful