C++ Atomic Template: Inner workings in TBB Share your comment!


Threading Building Blocks includes a template class called atomic that helps you accomplish some atomic-level tasks. Under the hood these tasks are actually implemented at the processor level using assembly language. But you don’t need to do any assembly programming yourself. Let’s take a look at this interesting class. 

Buried down inside the Threading Building Blocks documentation is a template class called atomic. At first glance it appears simple  and nothing special, providing “read, write, fetch-and-add, fetch-and-store, and compare-and-swap.” Not much there, but in fact, it’s these are useful and highly optimized.

In my article about performing lazy initialization, I briefly mentioned this little template. Let’s take a closer look at it and see why it’s optimized for parallel programming.

Fencing Memory is Key

At the heart of the atomic type is a technique called memory fencing. There are a lot of highly technical explanations across the web of what exactly memory fencing is. But for a quick explanation, just think of it as a forced ordering. Two operations run in separate cores, with one putting up a fence to block other operations from accessing memory until the first thread is finished. Typically this means doing a couple of operations that must happen together as a single, atomic operation. For example, a thread might need to read the contents of some memory, add one to it, and write it back into memory, with no other threads making the change. This can actually be accomplished at the assembler-level, essentially implementing a very low-level mutex.

Threading Building Blocks uses a minimal amount of assembly language to accomplish parallel tasks. The atomic template is one such place, making use of fencing features of the Intel processor. Down in the source code are a few directories with processor-specific assembly code for accomplishing fences at the assembler level. The header file atomic.h includes several macros that make calls to this assembly code. What this means is that the code used in the atomic template is about as optimized as it can get. So I encourage you to make use of this deceptively simple template. 

The atomic template includes a member function called compare_and_swap. I briefly mentioned this function in the article about lazy initialization. Here’s more of the story:


When you look at the lazy initialization algorithms, specifically the one that doesn’t use mutexes, you might get a bit concerned that by not using mutexes, you could still end up with a race condition. In fact, you don’t, because template’s compare_and_swap function makes use of the assembly code that implements the fencing. (Check out the other blog for more info on what exactly compare_and_swap does.)

Intel processors provide built-in support for fencing through the use of a LOCK opcode that precedes certain other opcodes, including one called CMPCHG. This CMPCHG opcode does exactly what the compare_and_swap function does, but at the assembly level. TBB uses this opcode and adds in the LOCK instruction so that the compare_and_swap member function runs at the assembly level with a single instruction. And the LOCK opcode provides the hardware-level mutex we need. I’d say that’s pretty optimized! And it has minimal—if any—performance bottlenecks.


Another function the atomic template provides is fetch_and_add. This is also implemented at the assembly level with a LOCK instruction, and with a single instruction with two opcodes that does the work.

The idea behind fetch_and_add is to retrieve the value of a variable, add something to it, and store it back in. This is important in parallel processing because you want the whole thing to happen without another thread intervening before the task is complete.

For example, if two threads are using the same variable, and both threads need to add to the variable, you want the final value to be the original value plus both added-on values. But if one thread grabs the value, and the other thread grabs it too, before the first thread can write the new value back, then the final value will be wrong.

A common example of this is in reference counting. When a thread needs to access a variable, and the variable needs to keep a reference count by adding one to an internal integer, you don’t want a race condition to cause the increment to be wrong. The great thing here is that the processor can handle this at the assembler level, meaning it’s fast and easy to do. That’s how the fetch_and_add function works, and why it’s a key part of the atomic template. Again, this one also uses the LOCK opcode to implement a mutex of sort. 

Assembly Level Locking = Great Thread Performance

Next time we’ll look at some actual code examples of the atomic template, and how you can implement various techniques such as a reference counter. Since the atomic template makes uses of assembly level locking, it’s efficient and has excellent thread performance. And since it’s a template, you can use it to build your own types, resulting in code that’s both easy to use and highly optimized—all within a parallel environment. 

Posted on April 5, 2013 by Jeff Cogswell, Slashdot Media Contributing Editor
josvehv like.author.displayName 1 Like

Indeed, C++ Atomics can provide a very efficient implementation for some basic multi-threading issues. It is worth noting so, as many seem to be unaware of that. However:

a) For using atomics, you typically don't fall back to assembly. You preferably use for instance the available GCC intrinsics, or use the standard library calls as provided with the C11 standard. Atomics can provide efficient implementations for safe multi-threaded implementations of reduction expressions. Recognizing and handling reductions is an important concurrent design pattern.

b) Memory fences are in principle decoupled from atomic operations: you can usefully apply them with or without Atomics. For instance, they are also useful in combination with volatile variables. Furthermore, a memory fence operation enforces mutual ordering for the commitment to memory of load/stores before the fence with load/stores after the fence WITHIN one thread. Also for memory fence operations, the new C11 standardizes several types of calls and their semantics. Typically, memory fences and atomics (or volatiles) are used in combination to protect and enforce ordered access to other data. In particular, they are the elementary building blocks to implement semaphores.

Jos van Eijndhoven, Vector Fabrics.

LeslieSatenstein like.author.displayName 1 Like

Jeff, your article is much appreciated. I think though that the xor instruction may be useful in a non-assembler mode. Is not the xor a one unbroken cpu instruction?  xor of a character will change the bit pattern of that character as an atom operation.  

I do think that putting one or two xors into a operation as a "test" and or a "test and set" can be done to work in atomic mode. That is if we are using a processor that is not amd/intel architecture.


@LeslieSatenstein , thanks for that great thought. I'm going to investigate and see what I come up with. What you say sounds like it would probably work. (And perhaps it could be written in assembler, and packaged inside a C or C++ function.?) This might be a good topic for a future article as well. Thanks! --Jeff