Intel Cilk Plus – Easiest Path to Parallel Prowess? 2 Comments

Intel’s June 18 announcement of Xeon Phi finalizes go-to-market branding for the first desktop supercomputing HPC coprocessor board to feature Intel’s MIC (Many Integrated Cores) technology. Xeon Phi’s 50 abbreviated Xeon cores communing with a multicore CPU across PCI 3.0 promises a TeraFLOP or more of double-precision floating-point compute capability in two slots. And that capacity can scale much bigger in clusters, as supercomputer-maker Cray will prove with its new Cascade architecture, before year’s end.

But how do you exploit all this new power? The processors in the above-described supercomputer each offer several modes for parallelization. At the simplest level, they offer vector registers (512-bit registers in the case of MIC) whose contents — up to 16 double-precision floating-point numbers — can be crunched all at once with SIMD (Single Instruction, Multiple Data) instructions. One level higher, each hyperthreading core can run two threads, which can interoperate via shared memory or message passing.

Clearly, if one is forced to do even a significant part of the housekeeping, it becomes non-trivial — costly and technically difficult — to exploit either of these forms of parallelism individually, much less both together. Luckily, you don’t have to. Even without significant training in parallel programming, Intel compilers and software tools now make it possible to exploit SIMD vectorization and threadwise parallelism – and to do so on single-chip/multicore architectures, on coprocessor-enhanced desktop supercomputers, and on larger compute-cluster aggregations combining pairings of conventional multicore and many core CPUs.

Not only that, but you can usually do so without severely altering a well-designed serial codebase. So you can design and substantially debug your base application without worrying about parallelization at all , then parallelize by adding a small number of straightforward directives, and letting the compiler do the heavy lifting.

The very simplest way to harness parallel power on Intel platforms is to work ‘cookbook style’ with products like the Intel Math Kernel Library, highly-abstract functions, architected for optimal parallel performance. But an almost equally simple approach is to use Cilk Plus – a semantically economical toolkit that puts the power of SIMD vectorization and basic threadwise parallelism at the fingers of C and C++ developers, offering high performance and significant freedom to optimize, while hiding complexity and keeping code maintainable.

Cilk Plus is one of Intel’s Parallel Building Blocks, available under Intel Parallel Composer and Intel C++ Composer. It amounts to a very limited set of extended instructions plus an array notation for describing parallel operands. Threadwise parallelization is handily enabled by the keywords _Cilk_spawn and _Cilk_sync, respectively used to identify a function call executable in parallel threads and synchronize return values. A _Cilk_for construct, meanwhile, enables parallelization of for loops conforming to particular conditions.

The array notation, similar in some respects to for-loop syntax, enables direct, non-iterative expression of data-parallel operations, making it possible for the compiler to generate vector code efficiently, using SIMD instructions, without additional complex specification. Cilk Plus also provides a set of simple intrinsic parallel functions (e.g., add all elements of an array), plus the ability to designate custom Elemental functions — standard C scalar functions that operate asynchronously on multiple array elements.

The essentials of Cilk Plus can be learned in about 30 minutes, though getting optimal performance improvement takes a little practice, and profits by creating baseline serial code and running performance comparisons between this and the same code enhanced with Cilk Plus constructs. One excellent article detailing this technique, Getting Started with Intel Cilk Plus SIMD Vectorization and Elemental Functions, was recently posted by Intel’s Mark Sabahi and Xinmin Tian. 

That article should provide good grounding for an upcoming webinar, slated for June 26, 2012, from 9:00 AM – 10:00 AM Pacific / Noon – 1:00 PM Eastern, during which Intel’s Brandon Hewitt will offer a basic intro, enabling C++ programmers with no prerequisite training to become productive immediately, improving the performance of serial applications by adding Cilk Plus constructs for vector and threadwise parallelism. Hewitt — a Technical Consulting Engineer with Intel — is a frequent contributor to Intel’s parallel development forums.

Posted on by John Jainschigg, Geeknet Contributing Editor
2 comments
John Jainschigg
John Jainschigg

You raise an excellent point: One of the key ideas behind Xeon Phi, and Intel's MIC initiatives more generally, is to provide a standards-based compute cluster or cloud in silico. We raised this point ourselves in a recent blog ( http://goparallel.sourceforge.net/intel-mic-heads-market-1-tflop-xeon-phi-hpc-coprocessor/ ).

That's _why_ it's news. While the MIC architecture can indeed be used to churn lots of FLOPs -- one early prototype of MIC was actually referred to, internally, as "Flop Monster" -- FLOPS are only part of the point. What's more important is that Xeon Phi (plus sophisticated tools) lets developers quite painlessly port the current generation of multicore apps up to much larger numbers of cores without rewriting for a cluster and taking a potential overhead hit for whatever architectural changes are needed, and port apps created for cluster environments down onto a single die, with all the benefits to reduced message-passing overhead that monolithic silicon and a built-in switch matrix can provide.

This changes the economics of certain kinds of software development significantly, and thus may change the performance individual users of these applications experience. It may, for example, motivate rapid improvement of market-leading graphics software (e.g., non-linear video editors, etc.) -- now adapted to multicore but taking less-than-full advantage of GPU-based coprocessor boards -- because the effort-to-benefit ratio in doing so for Phi is relatively small. So, instead of requiring a render farm, a video editor may be able to edit and post HD video on one tower PC, or more likely, collapse what's now a multi-machine rendering setup onto one box. On the high side, it will certainly bring a class of scientific applications of the "embarassingly parallel" kind -- e.g., in astrophysics, fluid dynamics, bioinformatics, medicine, etc., down to where they can be run on a standalone box in limited cases.

More generally, what Phi is doing will change the assumptions we make about supercomputing in the commercial sphere -- Cray's upcoming Cascade machines, designed for conventional Xeons but shortly to incorporate Phi, will prove an interesting test-case for determining how much Phi will affect supercomputing TCO: on the one hand, improving application performance, but improving programmer productivity manyfold as well.

So ... what's news about Xeon Phi is, in a sense, the fact that it's _not_ news. In this case, being not-new (in the sense that Phi can be programmed in a continuum with conventional multicore chips and clusters/clouds) may be a game-changing advantage.

admin
admin

This type of thinking has been around since the I think '70 with MPI and PVM...

Really based around clusters of monolithc machines and from Windows 2000 supported as well...

It's just the hardware that changed a little bit...

Why is it news?