Tutorial: Understanding Intel’s New Processors and Tools Share your comment!

Intel Evangelist James Reinders explains how software development tools support features of current and emerging Intel hardware, including the 60-core Xeon Phi Coprocessor, now shipping, and products based on the code-named Haswell microarchitecture, scheduled to emerge next year. This high-level tutorial summarizes key points, including new programming features of the 3rd generation Intel Core™ Processors, Intel® HD graphics and new acronyms including AVX and TSX and support via C, C++, Fortran, OpenCL and OpenMP.

Context: Developer tools and hardware are co-evolving in several interesting ways.  One obvious one: hardware supporting more cores and wider vector registers, This, in turn, supports more kinds of parallelism, e.g., data parallelism (vectorization), task-based parallelism, and distributed parallelism. This evolutionary cadence has, as its first goal, enabling programmers to think about scaling performance efficiently, converting serial software into new forms that exploit all these parallel capabilities. Tools need to lead the hardware for all this to happen efficiently.

Intel’s new suite of tools, including Intel Parallel Studio XE  2013 and Cluster Studio XE 2013, contain compilers for C/C++ and Fortran — plus a host of toolkits for abstracting parallelism such as Intel Cilk Plus and Threading Building Blocks for C/C++, the Intel Math Kernel Library, Fortran-specific CoArray and other tools, MPI tools, aids for evaluating and planning parallel programming projects, tools for parallel debugging, memory, thread and other kinds of performance analysis.

The latest versions of these tools support all existing and expected Intel target architectures through Haswell and Xeon Phi already, so developers can use them immediately to code and tune for legacy chips, contemporary popular Intel hardware platforms, such as those based on the Ivy Bridge microarchitecture, as well as more-powerful, but not yet widely available manycore coprocessors (Xeon Phi) and upcoming Haswell products. They also support programming language and coding-model standards like C++ 11, Fortran 2009 and MPI 2.2, plus added features and functionality inspired by the needs of the software developer community.

This rigorous focus on supporting contemporary, emerging and future platforms and standards is becoming more critical, partly because of the way hardware is evolving to provide more and more parallelism with forward code portability. Products like Xeon Phi use X86-compatible cores, so can, in principle, run legacy software directly – a strong factor distinguishing them from competing GPU-based products.

In principle, the performance boosts enabled by products like Xeon Phi are epic. Xeon Phi was judged #150 among the Top500 performing computers in the world (early this Summer, with beta hardware and software) – using 60 cores, 240 processes and 512-bit-wide vectors to kick out 118 TFlops. Xeon Phi has also proven one of the most power-efficient processors in its class – more efficient than any other X86-based or GPU-based device, and edged out only by certain IBM Blue Gene systems.

But the potential for forward portability raises the stakes, suggesting that legacy codes will be able to exploit at least some of the power of these new devices, and that developing new codes will require limited new learning, effort and expense. And this promise can only be fulfilled if tools can overcome the intrinsic problems with code complexity and overheads that adding many more cores bring to the table.

Nor is the question solely one of code compatibility, but of the forward-looking cultivation and preservation of useful skills for programmers. Intel’s philosophy is that once you’re familiar with using a tool like Intel VTune Amplifier XE to tune code on four- or eight-core CPUs, you should be able to use that tool – and those same skills – to tune code on manycore systems. But there are also changes and improvements as the hardware and software evolve.

Second-generation advanced vector extensions (AVX2) are offered in the Haswell microarchitecture. These are designed to facilitate fast, bit-level operations on wide vectors of data for tasks like encryption, and graphical analysis: operations like shift-based multiplies, insert/extracts, masked load/store operations, pack/unpack, shuffle, permute, etc. Intel parallel programming models and tools like Cilk Plus, the Intel Math Kernel Library, Fortran vectorization, Integrated Performance Primitives, etc., all access the new AVX2 functionality.

The point is to free programmers to make intelligent choices. If you only want your program to run on Haswell type supercomputers, you can compile down to the latest AVX2 – and possibly get faults or errors if you try to deploy on systems without AVX2 capability. But the tools also enable you to link in a version of Math Kernel Library that figures out if AVX2 is available at runtime, and uses it, if so. Making tools that work this way helps eliminate the absolute need to separately support SSE or AVX, with resulting complexity and code-bloat.

Intel Transactional Synchronization Extensions (TSX) also make a first appearance in the 2013 Haswell microarchitecture. One of the classic problems in parallel programming is the need to synchronize. So sharing data between multiple threads means that you need locks and mutexes to prevent data from being updated out of sequence. But as your code grows and data-sections become larger, the tendency is to start putting locks around bigger and bigger sections of code and data, creating situations where threads are locked out of work for long periods of time.

Historically, the burden has been on programmers to make their locks fine-grained. So if you have a large data-structure like a hash table, current best-practice is to lock out just portions of the table to reduce the likelihood of collisions. Unfortunately, that kind of code gets more and more complex, and the problem is compounded when empirical evaluation of code behavior shows – as it often does – that collisions are infrequent.

TSX introduces two solutions to this syndrome, both based on hardware that can monitor collisions and, in the rare situations where they occur, back up and ‘do the right thing.’ The first solution, called Hardware Lock Elision (HLE), is said to be a simple, completely backward-compatible extension accessed via hint instructions that will be ignored in compiling for older microarchitectures. The second approach, called Restricted Transactional Memory (RTM), is less back compatible and obliges the programmer to provide an alternative code-path for use when collisions occur. The upshot? These advances let you write a coarse-grained lock and get fine-grained performance when collisions aren’t common.  (More on this at http://tinyurl.com/haswell2013)

Parallel programming for performance on virtual machines running on Intel hardware is another important change becoming increasingly possible. VMware Fusion 5, for example, can now expose performance-monitoring counters, so you can use VTune to tune performance of applications running on VMware virtualized hardware environments.  Several years ago, VTune was upgraded to recognize when it was running on a VM, and skip installing the driver to access performance counter data. Now this has turned around so that VTune can be used to look at things like cache misses, and other critical performance analytics. (More on the subject at http://tinyurl.com/vmwarevtune.)

Intel’s VTune Amplifier XE offers new analysis features, including the ability to analyze CPU power consumption and optimize code for efficient power use in situations ranging from mobile to server applications. The new features let you identify reasons and rates for CPU wake-ups caused by timers and interrupts, look at source code for events that wake the processor, and figure out how much processor sleep is happening over time and at different sleep levels.

Vectorization – something either very well or very poorly supported by tools today – is central to all this. At the simplest level, all Intel compilers can auto-vectorize code up to a point: recognizing situations where conflict-free execution of loops is possible, moving data appropriately and converting code to SIMD instructions. But auto-vectorization is far from a complete solution, because the compiler will frequently encounter code it cannot prove is safely vectorizable.

Advanced vectorization programming models like Cilk Plus offer developers fundamentally simple means for helping the compiler vectorize better – in effect, letting the programmer tell the compiler that code can be vectorized safely. The Intel Math Kernel Library and Intel Integrated Performance Primitives let developers exploit vectorization indirectly. But these tools, too, don’t solve the whole problem, since they can’t, by themselves, cause data to be laid out in ways compatible with available vector registers.  Intel’s goal is put even better capabilities in the hands of developers. This may eventually mean some language changes to help disambiguate when vectorization can be used safely, and when not.

Intel Threading Building Blocks (TBB) were introduced in 2006, and are now the most widely used abstraction for task-based parallelism. TBB fits well in C++ and provides solutions for parallel tasks, algorithms, concurrent containers, synch and memory management. Cilk Plus and TBB were the tools most used in creating examples for Reinder’s current book, Structured Parallel Programming, written   with Intel’s Mike McCool and Arch Robison. (http://www.parallelbook.com)

Will Intel extend Threading Building Blocks capabilities to Fortran? On one hand, TBB was designed with C++ highly in mind. But on the other, the fundamental capability of TBB is to do task stealing – to let you write tasks and have those tasks be dynamically moved around and expanded to the hardware – something any language should be able to exploit.   Intel has no current plans to take TBB itself over to Fortran (and that open source initiatives to do this have disappointingly failed to gel).

But several other tools and initiatives give Fortran significant parallel development flexibility. For example: DO CONCURRENT, concurrency at the loop level done right in the language, as part of the Fortran 2008 standard. Also worth a look is CoArrays — a PGAS – Partitioned Global Address Space extension for Fortran, now implemented in Intel’s Fortran compilers, and undergoing R&D to improve performance, as well as Fortran support in OpenM.

Evaluation Tools help programmers evaluate existing software, and determine how and whether it can be parallelized effectively. Intel Advisor XE, for example, lets you pull code into the tool in order to identify areas where parallelism can be applied to advantage. Advisor lets you add statements to mock up proposed parallelization strategies, and can then pre-evaluate both what kind of performance improvement you’ll get from the advised actions and what kinds of risks (e.g., races, deadlocks) you’re likely to incur.  An example:  A piece of code had proven resistant to performance improvement. Advisor XE suggested that it was written to allocate too-small tasks to threads, and proposed an easily implemented way of allocating larger tasks, thus obtaining a projected 5x – 8x performance improvement.

Conditional Numerical Reproducibility is a feature Intel added to current versions of Parallel Studio in response to demand from the developer community. Performing mathematical operations like summing floating-point numbers of widely varying magnitudes can create varying results, due to round off errors that become significant when operations are performed in parallel, in unpredictable sequence.  These problems slow debugging and frustrate software validation. So Intel has added new features to its tools that can exploit parallelism while enforcing operations to run in the same sequence, time after time (at least when run on the same number of cores), enabling programmers who are using Math Kernel Library, OpenMP or other tools to get reproducible answers.

Users of this feature typically incur only about 10-20% slowdown – so they don’t have to give up the benefits of parallelism entirely in exchange for numerical reproducibility. This is one of several cases where the nature of tools has evolved to compensate for a downside incurred by improved parallel capabilities in hardware. Here, the wider vector capabilities exploited by AVX cause math to get done faster, and thus less predictably.

Summarized by John Jainschigg, Geeknet Contributing Editor. 

Posted on December 14, 2012 by John Jainschigg, Geeknet Contributing Editor