Intel’s June 18 announcement of Xeon Phi finalizes go-to-market branding for the first desktop supercomputing HPC coprocessor board to feature Intel’s MIC (Many Integrated Cores) technology. Xeon Phi’s 50 abbreviated Xeon cores communing with a multicore CPU across PCI 3.0 promises a TeraFLOP or more of double-precision floating-point compute capability in two slots. And that capacity can scale much bigger in clusters, as supercomputer-maker Cray will prove with its new Cascade architecture, before year’s end.
But how do you exploit all this new power? The processors in the above-described supercomputer each offer several modes for parallelization. At the simplest level, they offer vector registers (512-bit registers in the case of MIC) whose contents — up to 16 double-precision floating-point numbers — can be crunched all at once with SIMD (Single Instruction, Multiple Data) instructions. One level higher, each hyperthreading core can run two threads, which can interoperate via shared memory or message passing.
Clearly, if one is forced to do even a significant part of the housekeeping, it becomes non-trivial — costly and technically difficult — to exploit either of these forms of parallelism individually, much less both together. Luckily, you don’t have to. Even without significant training in parallel programming, Intel compilers and software tools now make it possible to exploit SIMD vectorization and threadwise parallelism – and to do so on single-chip/multicore architectures, on coprocessor-enhanced desktop supercomputers, and on larger compute-cluster aggregations combining pairings of conventional multicore and many core CPUs.
Not only that, but you can usually do so without severely altering a well-designed serial codebase. So you can design and substantially debug your base application without worrying about parallelization at all , then parallelize by adding a small number of straightforward directives, and letting the compiler do the heavy lifting.
The very simplest way to harness parallel power on Intel platforms is to work ‘cookbook style’ with products like the Intel Math Kernel Library, highly-abstract functions, architected for optimal parallel performance. But an almost equally simple approach is to use Cilk Plus – a semantically economical toolkit that puts the power of SIMD vectorization and basic threadwise parallelism at the fingers of C and C++ developers, offering high performance and significant freedom to optimize, while hiding complexity and keeping code maintainable.
Cilk Plus is one of Intel’s Parallel Building Blocks, available under Intel Parallel Composer and Intel C++ Composer. It amounts to a very limited set of extended instructions plus an array notation for describing parallel operands. Threadwise parallelization is handily enabled by the keywords _Cilk_spawn and _Cilk_sync, respectively used to identify a function call executable in parallel threads and synchronize return values. A _Cilk_for construct, meanwhile, enables parallelization of for loops conforming to particular conditions.
The array notation, similar in some respects to for-loop syntax, enables direct, non-iterative expression of data-parallel operations, making it possible for the compiler to generate vector code efficiently, using SIMD instructions, without additional complex specification. Cilk Plus also provides a set of simple intrinsic parallel functions (e.g., add all elements of an array), plus the ability to designate custom Elemental functions — standard C scalar functions that operate asynchronously on multiple array elements.
The essentials of Cilk Plus can be learned in about 30 minutes, though getting optimal performance improvement takes a little practice, and profits by creating baseline serial code and running performance comparisons between this and the same code enhanced with Cilk Plus constructs. One excellent article detailing this technique, Getting Started with Intel Cilk Plus SIMD Vectorization and Elemental Functions, was recently posted by Intel’s Mark Sabahi and Xinmin Tian.
That article should provide good grounding for an upcoming webinar, slated for June 26, 2012, from 9:00 AM – 10:00 AM Pacific / Noon – 1:00 PM Eastern, during which Intel’s Brandon Hewitt will offer a basic intro, enabling C++ programmers with no prerequisite training to become productive immediately, improving the performance of serial applications by adding Cilk Plus constructs for vector and threadwise parallelism. Hewitt — a Technical Consulting Engineer with Intel — is a frequent contributor to Intel’s parallel development forums.