Xeon Phi lead architect George Chrysos presented comparisons between using Xeon Phi co-processors instead of graphics-processor units (GPUs) at the recent Hot Chips conference. According to the Top500 Super Computer Sites ranking, Intel’s many-integrated core (MIC) architecture not only outperformed the two top GPU-based supercomputers on the most recent Top500 list, but was also “greener” by virtue of providing more performance-per-Watt.
“We optimized Intel’s Xeon Phi coprocessor to deliver leading performance per watt for highly parallel technical computing workloads,” Chrysos explained. “Power efficiency or performance-per-watt on the target workloads was our key metric of goodness.”
The strategy worked too, according to the most recent Top500 list of parallel processing supercomputers. The Top500 list compared the performance and energy consumption of Intel’s Xeon Phi based “Discovery” supercomputer to two GPU-based supercomputers–one using Nvidia– and one ATI-based GPUs.
Intel’s Discovery Cluster was rated 150 on the Top500 and used Xeon E5-2670 8C 2.6-GHz main processors communicating with Xeon Phi coprocessors over a fourteen data rate (FDR) Infiniband interconnect. Intel’s Xeon-Phi-based Discovery Cluster was eight percent more power efficient than Barcelona Supercomputing Center’s Nvidia Tesla GPU-based supercomputer rated 177 on the Top500. The Bull SA’s B505 at the Barcelona Supercomputing Center used Xeon E5649 6C 2.5-GHz main processors with Nvidia 2090 GPUs connected by quad-data-rate (QDR) Infiniband. The Discovery Xeon Phi Cluster also edged out in power efficiency the Nagasaki Degima Cluster supercomputer rated at 456 on the Top500. The Degima is based on Intel i5 main cores and ATI Radeon GPUs communicating using a quad-data-rate (QDR) Infiniband (see figure).
The main reason for the better power-to-Watt performance profile of Xeon Phi coprocessors over GPUs was the extension of Intel’s sophisticated power management technology to the 50+ cores on a Xeon Phi die. As a consequence, only the cores that are currently running parallel threads were consuming significant amounts of power.
“We put Intel’s world-class power management technology into the Xeon Phi,” Chrysos concludes. “When individual cores are idle, or the Xeon Phi is not processing anything, we reduce its power consumption proportionately.”
Chrysos notes that Intel made three major architectural improvements to optimize the Xeon Phi architecture in order to achieve its higher performance-per-watt rating over GPUs.
First, Intel boosted execution optimization to achieve 80-percent improved core performance, as measured by the CPU-intensive Spec CPU FP 2006 benchmark. The faster speed allowed tasks to execute more quickly, and was accomplished by switching among four parallel threads per core, rather than potentially wasting time doing speculative instruction processing as is common on pipelined architectures. In addition, Intel added a hardware instruction pre-fetcher to the Xeon Phi, a 512-bit wide L1 cache, a large 512kbyte L2 cache and a large translation look-aside buffer (TLB).
The second major improvement to performance-per-Watt was achieved by widening the single-instruction-multiple-data (SIMD) instructions to 512-bits. New SIMD instruction set features were also added–including register masking for vectorization of conditional branches for better pipelining and gather/scatter functions for faster loads-and-stores from irregular addresses. Also extended math-unit operations were added to allow vectorization of many common transcendental, square-root, reciprocal, logarithm and power functions.
The third major power efficiency improvement to the Xeon Phi was accomplished by adding a 512-bit wide bi-directional ring to connect cores to each other and to memory, along with a new streaming vector store instruction that conserves bandwidth when writing output-only arrays.
Colin Johnson is a Geeknet contributing editor and veteran electronics journalist, writing for publications from McGraw-Hill’s Electronics to UBM’s EETimes. Colin has written thousands of technology articles covered by a diverse range of major media outlets, from the ultra-liberal National Public Radio (NPR) to the ultra-conservative Rush Limbaugh Show. A graduate of the University of Michigan’s Computer, Control and Information Engineering (CICE) program, his master’s project was to “solve” the parallel processing problem 20 years ago when engineers thought it would only take a few years. Since then, he has written extensively about the challenges of parallel processors, including emulating those in the human brain in his John Wiley & Sons book Cognizers – Neural Networks and Machines that Think.