Parallelizing applications for scalable performance can be a daunting task, not least because multi- and many-core processors make you think in two separate directions. One skill set is required to address threading – analyzing the workload, spawning threads to do the work while preventing races and other forms of contention, retrieving and ordering results, and so on. Another skill set is required to address vectorization: how to exploit powerful built-in instructions that process arrays of data-elements stored in vector registers.
The latter can be done with inline assembler, of course. But that’s difficult, and won’t scale forward easily as vector registers become wider and wider over successive chip generations (in today’s multicore CPUs, 256-bit-wide vector registers are the norm; whereas in Xeon Phi’s exponentially greater number of cores, vectors are now 512 bits wide). Auto-vectorization – code-analysis and vectorization by the compiler – offers a potential stopgap solution, for some parts of some codebases. But native serial code semantics are often too ambiguous – containing too many implicit dependencies – for today’s compilers to vectorize directly.
A better answer, as more and more C/C++ developers are discovering, is Intel Cilk Plus: an extended syntax with compiler enhancements and a runtime engine (plus associated tools) for building parallel applications that exploit vectorization and multithreading in limited, but extremely powerful ways.
C/C++ developers’ best bet
We’ve written about Cilk Plus quite a bit on this site, shown some examples of its use, and offered links to written resources on the product. Recently, however, on June 26, Intel presented a webinar entitled “Improve Performance With Intel Cilk Plus” (Windows only, requires GoToMeeting codec; Webinar slides available here) on the topic, led by Senior Technical Consulting Engineer Brandon Hewitt, a 12-year veteran of Intel’s C++ compiler program. The webinar stands as a comprehensive single-source backgrounder on what Cilk Plus is, how it works, and why it’s probably C/C++ developers’ best bet – both for putting existing serial applications on the parallel train, and for developing certain classes of new application for the long term.
Hewitt begins with a short explanation – liberally cribbed above – for why a product like Cilk Plus is needed. While his detailed remarks don’t bear easy summary, the gist is that Cilk Plus provides an easy-to-understand set of semantic conventions whereby programmers can take serially expressed procedures and functions, disambiguate, and make assertions about them – making them deterministic – so that the compiler and the runtime engine can parallelize and vectorize them safely. Doing so, meanwhile, doesn’t disrupt the serial semantics of the program. Indeed, Cilk Plus actually generates a serial version of deterministic source that’s useful in debugging, and that, in theory, should perform virtually identically to the Cilk Plus-parallelized version, limited to execution on a single worker thread.
Huge win for programmers
From the programmer’s viewpoint, this is clearly a huge win – with very little and fairly simple actual coding, you can collaborate with your compiler to quickly produce optimally performing parallel applications, rather than watch it kick back complaint after complaint as it tries to interpret your intentions in failed attempts to auto-vectorize things you already know should work.
For developers, Cilk Plus presents very minimalistically. For managing threads, only two keywords – _cilk_spawn and _cilk_sync, plus _cilk_for, used for parallelizing for loops – are required to parcel out the workload between them onto available cores: a conceptually simple fork/join process. Hewitt notes that Cilk actually imposes ‘implicit’ syncs in several other places (e.g. at the end of functions, before and after try/catch blocks and other code-blocks that use _cilk_spawn) in order to enforce composability. Part of the huge benefit of Cilk Plus is that when you use it to create a parallel function, the calling program doesn’t need to know that function is parallel or be concerned about interactions with other functions.
Dispatching work – in effect, architecting the compiler’s solution over available hardware in real time – is the job of the Cilk Plus runtime. While Hewitt didn’t go into under-the-hood details about the runtime, he made several important points about how it works. He noted that the runtime spawns worker threads in advance of Cilk Plus statements being executed, doing so in a way that supports smooth scaling to highly variable numbers of cores/hyperthreading facilities. He articulated several times that the runtime normally coordinates threads in ways more efficient than conventional mutex approaches permit, so offers the potential for higher performance and more linear scaling. He also noted that the Cilk Plus runtime – in order to ensure composability and compatibility of Cilk Plus with Intel Threading Building Blocks – needs to be deterministically in control of spawning and dispatching work to threads. In fact, he clarified that the cilk_spawn call is not actually a spawn, but indicates that the developer has given permission for the runtime to thread-parallelize a code block – an interesting insight.
According to Hewitt, assisted by the deterministic semantics of Cilk Plus, the compiler and runtime can collaborate to make intelligent decisions about how to parallelize certain constructs. For example, pragmas are available to help the compiler judge whether the workload of each iteration within a cilk_for loop is sufficient to warrant parallelizing the loop.
Array notation for vectorization
Hewitt offered a complete explanation, with examples, of Cilk Plus’ array notation for vectorization. Of particular interest was his explanation of stride and rank arguments, which assist in manipulation of vectors analogous to multidimensional arrays, and in the latter case, via direct access to CPU indexed-vector load and store instructions, support gather/scatter, and similar approaches to sparse linear algebra problems in image processing and other fields.
One of the most interesting parts of Hewitt’s talk was on Cilk Plus HyperObjects: task-specific parallelization frameworks for reducers. Hewitt offered a fascinating example of how naïve attempts to use cilk_for, for example, to parallelize a simple reduce function can result in data races; and showed why conventional, mutex-based parallelization techniques are inefficient for solving these problems, because the granularity of the payload is too small to justify the lock overhead – you end up getting performance that’s worse than serial in some cases. But when you apply one of Cilk_plus’s HyperObject reducers (in this case, a summing reducer over type int) at the beginning of the function, without changing any of the ‘naïve’ code, it wraps a reducer framework around the function that enables parallelization without the race contention, but also without the lock overheads, giving you optimal performance. Cilk Plus supports a wide range of HyperObjects for common reducer-type operations and you can write your own based on supplied prototypes.
There’s even – as one clever questioner brought out – a ‘holder’ HyperObject that, as Hewitt explained, is basically a reducer that doesn’t reduce anything, but merely stores and enables threads to operate on a unique view of a dataset – in effect, a form of thread-local storage.
Hewitt is straightforward about Cilk Plus’ limitations. It’s not a complete threading solution (Intel Threading Building Blocks is, however, and to boot, fully compatible with Cilk Plus). It’s not a cluster solution (for that, we look to Intel Cluster Studio). But for developers confronted with the need to parallelize an existing codebase, or create new code rapidly that will scale forward and upward nicely onto ever-more-powerful generations of many-core processors, it’s a very clean and well-designed solution.
John Jainschigg is a Geeknet contributing editor, and is CEO of World2Worlds, Inc., a digital agency focused on immersive technology and gaming. John’s initial intro to concurrency was via interrupt and re-entrancy programming at the assembler level on Z80 and 68000-based systems. He wrote concurrent, time-critical packet-switching applications on HP-UX RISC machines in the late 1980s, and since then has worked up and down the client-server stack in Java, C++, PHP, and other conventional and scripting languages, and more recently, in task-specific, state-based, radically concurrent languages like LSL.