Dividing Data with Cilk Plus Share your comment!

If you’ve been following my articles, you know you can use Threading Building Blocks (TBB) with blocked ranges to divide work into parallel chunks. That works well with TBB. But what about Cilk Plus?

Cilk Plus gives you less control over how your work is divided up. The tradeoff is that you get to use a simple syntax that’s very similar to regular loops in C++. But under the hood, Cilk Plus is busy breaking up the work for you.

Armed with a bit of knowledge about how to divide up the work, let’s investigate what exactly Cilk Plus is doing. We won’t do a full performance profiling, but will call into the Win32 API to get some timer results.

First, we need a simple program. Go ahead and create a new Win32 console program in Visual Studio with Parallel Studio installed. Switch to the Intel C++ Compiler by clicking Project -> Intel Parallel Composer 2011 -> Use Intel C++. Then insert this code into your main:

LARGE_INTEGER timercount;
cout << timercount.QuadPart << endl << endl; unsigned long long start = timercount.QuadPart; unsigned long long buffer[4096 * 2]; cilk_for (long i=0; i<4096; i++) { LARGE_INTEGER timercount2; ::QueryPerformanceCounter(&timercount2); FILETIME filetime; ::GetSystemTimeAsFileTime(&filetime); buffer[i * 2] = (unsigned long)__cilkrts_get_worker_number(); buffer[i * 2 + 1] = timercount2.QuadPart - start; } for (int i=0; i<4096; i++) { cout << i << ":" << buffer[i*2] << ":" << buffer[i*2+1] << endl; }

This code makes use of an unusual Windows API, QueryPerformanceCounter. This function makes use of an accurate high-resolution timer.  The goal here isn’t to get the exact time of each loop iteration, but to get a feel for the order of iterations.

I want to save the results, but don’t want to dump them to the console with each iteration, because consoles are notoriously slow. Also, the console seems to be somewhat thread-safe in that a single insertion into it appears to happen without interruption. But two insertions in a single call like this:

cout << a << b;

can cause a yield to other threads. Either way, we don’t want this to cause the threads to run in a non-optimal manner. The goal is to watch what happens, but we don’t need a computer-equivalent of the Heisenberg Uncertainty Principle from occurring (whereby inspecting the system causes it to change!).    For the same reason, we also don’t want to bog down the system with thread-safe containers (like those found in Threading Building Blocks). I want to run as close to the metal as possible without outside influence. The call to QueryPerformanceCounter seems to have little impact, and I need at least some way of getting the time. So I’m willing to stuff the results into an array. The array contains two items: The current worker thread number, and the current time.

Then I dump the entire thing to the console. I called my program CilkProject1, and I ran it in the console like this:

CilkProject1.exe > cilk.txt

I opened the resulting file in a text editor. I’m not going to list it all here, of course, but I saw lines like this:


and later on lines like this:


The first number is just the index number of the iteration. The second number is the thread number, and the final number is the time. But look! If you try this yourself, you’ll see the times are not in order. I’m dumping the array out in index order, and for the most part the times increase, which means those iterations were down in order. But look what I see right in the exact middle:


This tells me that iteration 2048 ran before iteration 2047. In fact, if I scroll back in the list, I see that it ran around the same time as iterations 857 through 867.

In other words, the iterations are definitely not running in order. In fact, Cilk Plus uses a recursive divide-and-conquer algorithm to divide up the range.

In my results I see thread 8 is used for the first 1024 iterations. Apparently thread 8 was the first one spawned, and it was used for the first quarter of the elements, while the system was deciding how to allocate the rest of the data. Thread 0 was used starting with element 1024 (where 0 is the first element) up to and including element 1535, a total of 512 elements. That’s a power of 2 (as is 4096, which is why I picked that for the size). You can see how the system is dividing by two over and over as needed to break up the data. It continues that way.

Sorting by Time

A next step would be to sort the array by time. Instead of writing code to do this, I replaced the : character with a tab in my text file, and pasted it into Microsoft Excel; then I sorted by the third column. That showed me the order that the items were happening in. And there it became very clear that the loop iterations weren’t happening sequentially. Here’s a screenshot:

Conclusion: No Telling How Iterations Divide

This proved to be an interesting test for several reasons. First, as we already knew, we can’t count on iterations happening in order. But it also showed how the data was broken up wasn’t as obvious as expected. Putting together the test, I tried different variations of the code and saw different chunk sizes occurring when re-running the program without changing the code. Sometimes the data got broken up into five chunks, other times six. It seems to be nondeterministic, and that’s the final take home message from this: You can’t count on how your iterations will be divided.

Posted on January 23, 2013 by Jeff Cogswell, Geeknet Contributing Editor