Much parallel programming is dependent on locks: you need to know a thread has completed work before another thread is permitted to run, so you implement a locking mechanism that the active thread can flip, and the waiting thread can check repeatedly until a desired state change is seen.
High-level parallel libraries, depending on the architectures they address and the problems they’re designed to solve, offer a whole hierarchy of locking abstractions, which are mapped to hardware features like the lock instructions on x86 and subsequent Intel processors.
These come in two main flavors selected by secondary opcode. The main and most-used flavor implements Compare-and-Swap (CAS) – an atomic (i.e. non-interruptible) way of getting a value from memory, computing a new value, and inserting it back in memory at the same location, if and only if the first value hasn’t meanwhile changed.
CAS has been mathematically proven (by Maurice Herlihy in 1991) to be effective for implementing all the common synchronization primitives (given enough memory) and in any case, more of them than other, similar atomic instructions such as Fetch-and-Add. An example of the latter is x86’s XADD, which tests and then increments a memory location only if the location hasn’t changed.
Normally CAS is the preferred instrument for crafting mutexes and similar idioms on Intel and most other hardware. But a recent blog by Dave Dice, who blogs for Oracle (https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs), identifies several contexts in which fetch and add offers better performance, or avoids a performance hit.
To my mind, the most interesting of these occurs in conditions where a LOAD:CAS loop busy waits under significant contention, sufficient to convince the CPU to assume that the failure branch off the test condition is most likely to execute – when in fact, under normal circumstances, it would never be reached.
This, in turn, can cause the speculative execution machinery to bring in a large number of instructions, only to discard them as a mispredict when the lock opens. Dice makes the point that an XADD formulation of the lock and check would avoid looping, so it sidesteps this potential issue.
A related point was made by Intel performance tuning expert Martin Thompson at his presentation on lock free programming at the March QCON conference in London. Thompson notes that when both the main and exit branches of a conditional within a loop contain a computation, it may oblige the speculative execution mechanism to mispredict and end up executing and then discarding code the algorithm itself would never have reached. He explains in detail both how to access Intel CPU execution metrics in C, and how to use VTune and other tools, plus eyeballing code, to expose such flaws.
________________________________________________________________
John Jainschigg is a Geeknet contributing editor, and is CEO of World2Worlds, Inc., a digital agency focused on immersive technology and gaming. John’s initial intro to concurrency was via interrupt and re-entrancy programming at the assembler level on Z80 and 68000-based systems. He wrote concurrent, time-critical packet-switching applications on HP-UX RISC machines in the late 1980s, and since then has worked up and down the client-server stack in Java, C++, PHP, and other conventional and scripting languages, and more recently, in task-specific, state-based, radically concurrent languages like LSL.






