The Texas Advanced Computing Center (TACC), at the University of Texas at Austin, held a two-day symposium on Intel Highly Parallel Computing on April 10 and 11, 2012.
Keynoted by Intel’s James Reinders, the symposium (papers posted here) offered developers a highly valuable behind-the-scenes look at the experience of Intel Many Integrated Core (MIC) programming in different contexts, with different goals, and with different toolkits.
Parallel Issues and Advantages
In his keynote, Reinders reiterated the troubling point that no historically or currently popular programming language was designed with parallelism in mind.
He then provided a compelling intro to Intel Cilk Plus and Threading Building Blocks, message-passing interface (MPI), and specialized constructs such as co-arrays in Fortran, pointing out their advantages (composability, compatibility, portability), the changes in programming style they potentially enable, and some of their present inadequacies in enabling fully task-specialized exploitation of the potential of very large-scale many-core coprocessors.
He also expressed some of the current confusion around extending OpenMP to offload tasks to coprocessors, some of which may be reflected in OpenMP 4.0, due in November this year.
Think Tasks vs. Threads
Reinders concluded with advice on how to approach MIC projects – thinking about “tasks” instead of “threads” – i.e. favoring the MIC native mode of execution, rather than the thread-based “offload” mode.
Tim Mattson of Intel Labs then offered a presentation describing history and learnings from Intel’s first research programs in designing, prototyping, and building many-core chips – specifically, its 80-core “FLOP monster” chip and the 48-core SCC or “Single-Chip Cloud Computer.”
EEs will find this presentation rich with scientific insight, but software folks – especially potential MIC developers – will gain perspective as well about the performance of on-chip networks, power requirements of cores, and fine-grained software-based power management and other topics. On the theme of power management in many-core architectures, one fascinating notion is that software can be dynamically adapted to run at low power by automatically implementing local and minimal reductions to floating-point precision while still producing acceptably accurate results.
The remainder of Day 1 was devoted to discussions of early MIC programming experiences by research teams in large-scale physics and climate simulation, and more-theoretical explorations of MIC programming topics by computer scientists.
Porting ENZO-R to MIC
Robert Harkness of the National Institute for Computational Sciences at Oak Ridge National Laboratory presented a detailed paper and presentation on porting ENZO-R – a demanding astrophysics simulator – to MIC. The motivation here, as he explained, is overcoming an approaching weak-scaling limit ENZO users face when running the software on petascale systems (like Oak Ridge’s Cray Jaguar) by using MICs in a more-horizontal, strong-scaling approach.
Harkness offered details of porting ENZO-R to MIC in native mode – evidently a fairly easy process, in part because the ENZO code is already vectorized appropriately – and offered some pointed critique to the Knight’s Ferry group on how to package MICs and upgrade Intel programming tools for many-core scientific applications.
The primary warning, which will surely be heeded, is that MIC and follow-on products need to have huge local memory – perhaps on the order of 100TB – enabling scientific applications to store and operate on an entire working set (the amount of data required for a single calculation pass) on the coprocessor. It was noted that the Knight’s Ferry cards used in tests were equipped with only 2GB of GDDR5 memory – presumably just a prototype configuration.
Nonetheless, on the restricted models and working sets compiled to evaluate the hardware, the Oak Ridge group was very pleased with the preliminary results it generated on KNF – the (unoptimized) software, running in native mode, scaled in a just-shy-of-linear way up to 32 core/tasks (32 cores per card). Not so (with these models) in KNF offload mode, the additional performance potential consumed in MPI message and disk-access overheads. This benchmark resonated with Reinders’ initial caveat to favor native mode for performance.
Improving Intel SCC Performance
In the afternoon, the focus shifted to performance, with a presentation by Randolf Rotta and colleagues from the Informatic Institute of Brandenburg Technical University in Cottbus, Germany, exploring methods of improving Intel SCC performance using specialized message-passing protocols optimized for managing small messages with low latency (as opposed to MPI).
Xiu Liu and colleagues from Rice University and the Pacific Northwest National Laboratory reported on porting HPCToolkit to Intel SCC, and using it to analyze and improve performance of two sample benchmarks. Somewhat comfortingly, the largest performance improvement was gained by identifying a routine in one benchmark that was effectively serializing execution by forcing processes to wait in sequence on sends and receives. A very similar analysis and rectification exercise is presented early in the Intel VTune manual. Correcting the problem yielded similarly dramatic results on MIC – a 71% performance uptick on the routine in questions, and 33.7% improvement in overall benchmark execution speed.
Second-day presentations continued in much the same vein – academic teams experimenting with KNF, and (generally) devising specialized methods to port and optimize scientific/numeric software on the many-core architecture.
There’s a great deal of rich detail in the papers and slides from the event, in some cases echoing Reinders’ keynote warning that while low-hanging fruit exists in the form of aggressive vectorization and task-based rather than thread-centric coding, the real over-the-horizon promise of MIC and successor many-core chips will only be realized as programming tools are gradually improved to enable transparent automated implementation of a host of essentially task-specific methods for computation and communications.
John Jainschigg is a Geeknet contributing editor, and is CEO of World2Worlds, Inc., a digital agency focused on immersive technology and gaming. John’s initial intro to concurrency was via interrupt and re-entrancy programming at the assembler level on Z80 and 68000-based systems. He wrote concurrent, time-critical packet-switching applications on HP-UX RISC machines in the late 1980s, and since then has worked up and down the client-server stack in Java, C++, PHP, and other conventional and scripting languages, and more recently, in task-specific, state-based, radically concurrent languages like LSL.