Usage models for Intel’s many integrated core (MIC) architecture will determine which style of multi-processing programmers choose, according to the National Institute for Computational Sciences. NICS has been testing the Xeon Phi and its predecessors–Knight’s Ferry and Knight’s Corner–and recently described their results at the National Science Foundation (NSF’s) Extreme Science and Engineering Discovery Environment (XSEDE) Xtreme Scaling Workshop (Chicago, Ill.) NICS–a Joint Institute for Computational Sciences sponsored by the University of Tennessee and Oak Ridge National Laboratory (ORNL)–established its Application Acceleration Center of Excellence (AACE) last year under Director Glenn Brook, who co-authored the recent XSEDE presentation with computational scientist Vincent Betro and intern Ryan Hulguin entitled Hybrid Message Passing and Threading for Heterogeneous Use of CPUs and Intel Xeon Phi Coprocessor.
The scientists main result is that Intel’s Xeon Phi offers unique opportunities for parallel processing by allowing programmers to choose from among three primary methods of partitioning their parallel code for maximum speed and minimum execution times–native, off-load and heterogeneous–compared to general purpose graphic processing units (GPGPUs) which only operate in offload mode.
“The Intel Xeon Phi differentiates itself from GPGPUs by using x86-compatible cores, which allow the coprocessor to support the same programming models and tools as the Intel Xeon processor,” said Betro. “This allows scientific application programmers to accelerate highly parallel programs with little to no code modification, thus allowing them to focus on simply tuning the algorithms for performance.”
Their comparisons of native, offload and heterogeneous modes is based on extensive testing regimes derived from NICS deep-and-wide experience in petaFLOP-caliber <A HREF=”http://intel.ly/m0NW2O”>parallel processing</A> and AACE’s mission of developing multi-processing methodologies, distributing the knowledge gained throughout the HPC community, and providing expert feedback to supercomputer vendors to guide the development of future architectures and programming models.
For testing Intel’s MIC, NICS used three Xeon-based multi-processors named after chess pieces–Rook, Pawn and Bishop–as well as and a fourth system named Beacon based on an Appro cluster. The Rook system was NICS first MIC, which used early prototypes of two Intel’s Westmere-era Knight’s Ferry processors, which were later, upgraded to a single Knight’s Corner processor. Rook was used to explore porting and optimization techniques for MIC. Pawn was NICS’ second prototype MIC based on two Sandy Bridge processors, enabling testing of parallel algorithms that access multiple Knight’s Corner boards. Bishop was NICS’ first MIC-cluster based on a Cray CX1 and was featured in ORNL’s booth at the most recent International Conference for High Performance Computing, Networking, Storage and Analysis (SC11). And Beacon is NICS most advanced cluster computer, using two Xeon E5-2670 main processors with 16-cores total and two Knights Ferry processors featuring 60+ cores that are currently being upgraded to Xeon Phi.
According to NICS results, Intel’s Xeon Phi is unique among parallel processors in its ability access three usage models: native mode, offload mode and heterogeneous mode.
“In native mode, parallel code runs exclusively on the coprocessors, whereas in offload mode the primary code runs on the host processor with small sections of code are offloaded to the coprocessor,” said Betro. “The heterogeneous model employs Xeon Phi coprocessors interacting with Xeon host processors as networked peers. This usage model enables a standard approach of multi-threading with message-passing to be applied across both the host processors and the Xeon Phi coprocessors, which are treated as independent nodes. This paradigm allows the user to take advantage of the host processors for largely serial parts and the Xeon Phi coprocessors for highly parallel sections of the same code.”
The NICS testing regime also differentiated between two different types of heterogeneous modes: symmetric, where all peers have the same number of message-passing interface (MPI) ranks; and asymmetric, wherein different processors can have different workloads.
“Heterogeneous symmetric and asymmetric modes allow the Xeon processors and Xeon Phi coprocessors to work together as networked peers in much the same manner employed in the majority of MPI-based codes today, providing a migration path for many codes and offering complete control of computation and communication to the programmer,” said Betro. “Through seamless MPI execution on the host, a variety of uses of heterogeneous mode showed much promise in fitting to normal application workflows.”
In their presentation, the NICS scientists presented numerous test examples for the various modes, concluding that programmers will likely use all three depending on their application. For instance, applications that use massively parallel operations each of which requires relatively small memory will likely use native mode that is compatible with OpenMP algorithms. However, applications with large sections of serial code, but with occasional use of vectorized code that is easily threaded, will likely use offload mode with hybrid MPI/OpenMP codes that amortize movement of data to and from the Xeon Phi over the PCIe bus. Heterogeneous mode, on the other hand, serves as a kind of catchall methodology, allowing complete control of computation and communication among networked peers, and providing a migration path for the majority of MPI-based codes out there today.