C++ Atomic Template: Inner workings in TBB Share your comment!


Threading Building Blocks includes a template class called atomic that helps you accomplish some atomic-level tasks. Under the hood these tasks are actually implemented at the processor level using assembly language. But you don’t need to do any assembly programming yourself. Let’s take a look at this interesting class. 

Buried down inside the Threading Building Blocks documentation is a template class called atomic. At first glance it appears simple  and nothing special, providing “read, write, fetch-and-add, fetch-and-store, and compare-and-swap.” Not much there, but in fact, it’s these are useful and highly optimized.

In my article about performing lazy initialization, I briefly mentioned this little template. Let’s take a closer look at it and see why it’s optimized for parallel programming.

Fencing Memory is Key

At the heart of the atomic type is a technique called memory fencing. There are a lot of highly technical explanations across the web of what exactly memory fencing is. But for a quick explanation, just think of it as a forced ordering. Two operations run in separate cores, with one putting up a fence to block other operations from accessing memory until the first thread is finished. Typically this means doing a couple of operations that must happen together as a single, atomic operation. For example, a thread might need to read the contents of some memory, add one to it, and write it back into memory, with no other threads making the change. This can actually be accomplished at the assembler-level, essentially implementing a very low-level mutex.

Threading Building Blocks uses a minimal amount of assembly language to accomplish parallel tasks. The atomic template is one such place, making use of fencing features of the Intel processor. Down in the source code are a few directories with processor-specific assembly code for accomplishing fences at the assembler level. The header file atomic.h includes several macros that make calls to this assembly code. What this means is that the code used in the atomic template is about as optimized as it can get. So I encourage you to make use of this deceptively simple template. 

The atomic template includes a member function called compare_and_swap. I briefly mentioned this function in the article about lazy initialization. Here’s more of the story:


When you look at the lazy initialization algorithms, specifically the one that doesn’t use mutexes, you might get a bit concerned that by not using mutexes, you could still end up with a race condition. In fact, you don’t, because template’s compare_and_swap function makes use of the assembly code that implements the fencing. (Check out the other blog for more info on what exactly compare_and_swap does.)

Intel processors provide built-in support for fencing through the use of a LOCK opcode that precedes certain other opcodes, including one called CMPCHG. This CMPCHG opcode does exactly what the compare_and_swap function does, but at the assembly level. TBB uses this opcode and adds in the LOCK instruction so that the compare_and_swap member function runs at the assembly level with a single instruction. And the LOCK opcode provides the hardware-level mutex we need. I’d say that’s pretty optimized! And it has minimal—if any—performance bottlenecks.


Another function the atomic template provides is fetch_and_add. This is also implemented at the assembly level with a LOCK instruction, and with a single instruction with two opcodes that does the work.

The idea behind fetch_and_add is to retrieve the value of a variable, add something to it, and store it back in. This is important in parallel processing because you want the whole thing to happen without another thread intervening before the task is complete.

For example, if two threads are using the same variable, and both threads need to add to the variable, you want the final value to be the original value plus both added-on values. But if one thread grabs the value, and the other thread grabs it too, before the first thread can write the new value back, then the final value will be wrong.

A common example of this is in reference counting. When a thread needs to access a variable, and the variable needs to keep a reference count by adding one to an internal integer, you don’t want a race condition to cause the increment to be wrong. The great thing here is that the processor can handle this at the assembler level, meaning it’s fast and easy to do. That’s how the fetch_and_add function works, and why it’s a key part of the atomic template. Again, this one also uses the LOCK opcode to implement a mutex of sort. 

Assembly Level Locking = Great Thread Performance

Next time we’ll look at some actual code examples of the atomic template, and how you can implement various techniques such as a reference counter. Since the atomic template makes uses of assembly level locking, it’s efficient and has excellent thread performance. And since it’s a template, you can use it to build your own types, resulting in code that’s both easy to use and highly optimized—all within a parallel environment. 

Posted on April 5, 2013 by Jeff Cogswell, Slashdot Media Contributing Editor