The first time you write, compile and execute a two-process ‘hello world/hi world’ program on a concurrent operating system, it looks like a miracle. But a lot of that miraculous quality depends on the accidental fact that the simplest way of writing that program involves passing literals or using two different variables to hold the output strings, thus preventing contention. The very next program you write — usually some competing-operations thing that has two threads changing the same variable — makes you go ‘whoa,’ because you can see the car-crash happening when threads blindly do their thing without taking turns nicely. And that’s when they introduce the idea of locks, waits and spins.
Locks are at once a great enabler — and a great hurdle — of concurrent programming. Understanding the different species of locks (waits, spinlocks, etc. – there are arguably at least seven), how to implement and use them is a whole, vast science. And there’s another whole science about optimizing around them, because, as you quickly learn, making concurrent effectors wait on one another is a bad thing: the enemy of performance, efficiency and scale.
For this reason, programming without locks, where practical, can be a very big deal, and the discipline is now getting more attention, though it’s still not widely understood. That may not matter much, in a practical sense, to many programmers involved in adapting existing software, or producing new software for multicore environments. That’s because compilers and tools (e.g., Intel Parallel Studio and Cilk Plus) are getting so much more expressive – enabling developers to disambiguate their requirements and intentions and providing them with serial-friendly parallel semantics, specialized lock-free data-structures and function templates, runtime libraries and other aids for performing parallel operations in ways that exploit processor and OS features for lock-free arbitration, as well as accurately minimizing and demarcating critical code segments, minimizing areas where contention can occur.
In productivity terms, simple ‘bang for buck’ math suggests that it makes more sense, in the short term, to master Cilk Plus and hyperObjects, and let their lock-free runtimes and data-parallel architectures — developed over years by people like Michael McCool — do your lock-free programming for you.
But … learning something about lock-free programming is both interesting and useful. Immediately useful, in that understanding more about what compilers, tools, template reducer functions, runtimes and other aids are doing will help you use them better; and speculatively useful, in that despite their sophistication, tools have limits, and you sometimes need to be able to go beyond them.
A good, quick place to start is a podcast interview done in 2009 by Intel’s Aaron Tersteeg of Dr. Clay Breshears, many-titled Intel threading guru and author of “The Art of Concurrency,” in which Dr. Breshears offers a simple, but thought-provoking answer to a listener question about lock-free algorithms. Breshears focuses attention on the way processor-level CAS (Compare and Swap) atomic instructions can be used to arbitrate around shared resources in a more-granular and efficient way than by classically demarcating and using operating-system-level mutual exclusion — with all its overheads — around critical code segments. It’s a good intro to how processor ‘lock free’ arbitration facilities work, and how they can save resources.
Another resource I recently stumbled on provides a technically sound, empirical and demonstration-rich introduction to lock-free programming from a developer who’s made lock-free parallelism his mission. Dmitry Vyukov lives in Moscow – has an MS with Honors in HPC from Bauman Moscow State Technical University, is an Intel Black Belt developer in parallel programming, and won the Grand Prize at the Master Level at the Intel Threading Challenge in 2010. For the past several years, intermittently, he’s made additions to a blog resource on lock-free development called 1024cores, which is well worth a deep read – it now comprises most of a fairly-contemporary practical textbook on the wide topic area of “lock-free, wait-free, obstruction-free synchronization algorithms and data structures, … (and) … scalability-oriented architecture.”
Vyukov writes clear, witty prose with just enough flavor to make his works interesting. And he has an attitude to match: this is not a guy who minces words and fails to make opinions clear. Example: “The most stupid thing one can do is try to implement everything in a non-blocking style (assuming) of course (that) you’re not writing infantile research paper, and not betting a money.”
And he backs it up by being a truly inspired programmer and a very good teacher. So a thousand or so words down into the site, he’s already introduced the conceptual tools of Compare and Swap, Fetch and Add, Exchange, and atomic loads and stores, and has opened the door to thinking about the latter — rather than the Read-Modify-Write operations – as the focus of interest in lock-free synch and synch-free algorithm design. And he moves forward at this same, no-nonsense pace thereafter: covering each topic in brutally essential terms, while linking out to Intel Architecture manuals and other online resources directly for the details (though there’s a clear Intel emphasis in the material, SPARC and other architectures are referenced, and where appropriate, different language models and operating systems). So you get smart, deep, good, complete, fast enough that a working coder feels their time and pre-existing knowledge are both being respected. It’s the kind of site you really want to bookmark.
In a near-future blog, I’ll be asking Dmitry Vyukov some questions about lock-free development on Intel platforms.