Independent Test: Xeon Phi Shocks Tesla GPU Share your comment!

Intel’s Xeon Phi coprocessor outperforms Nvidia’s Tesla graphic-processing unit (GPU) on the operations used by “solver” applications in science and engineering, according to independent tests at Ohio State University.

When comparing Intel’s Xeon Phi to Nvidia’s Tesla, most reviewers dwell on how much easier it is to rewrite parallel programs for the Intel coprocessor, since it runs the same x86 instruction set as a 64-bit Pentium. 

Nvidia’s “Cuda” cores on its Tesla coprocessor, on the other hand, do not even try to emulate the x86 instruction set, opting instead for more economical instructions that allow it to cram many more cores on a chip.

As a result, Nvidia’s Tesla has 40-times more cores (2,496) than Intel’s Xeon Phi (60). The question then becomes: “is it worth it” to rewrite x86 parallel software for Nvidia’s Cuda, in order to gain access to the thousands of more cores available with Tesla over Xeon Phi?

Intel’s Xeon Phi SE10P (red) beat Nvidia’s Tesla C2050 and K20 GPUs (light and dark green, respectively) in 18 out of 22 tests. The Xeon Phi also beat dual Xeon X5680s (each with six cores for 12 cores total, light blue) and dual Xeon E5-2670s (each with eight cores for 16 total, dark blue) in 15 out of 22 tests. Source: Ohio State

To find the answer, Ohio State decided to narrow down the question to the types of parallel programs scientific researchers run regularly. For the test, researchers chose the parallel processing operations routinely performed on large sparse matrices. Variously called eigensolvers, linear solvers and graph-mining algorithms, these applications encode vast parallelism into wide-dense vectors multiplied by the large sparse matrices.

The results? Xeon Phi outperformed even the fastest Tesla coprocessor–the K20 with 2,496 cores each running at .7 GHz–while using only 61 cores each running at 1.1 GHz.

The coprocessors were tested on two batteries of 22 matrix operations–44 total–resulting in speeds ranging from 4.9-to-13.2G FLOPS for Tesla on the first battery.

The Xeon Phi, on the other hand, achieved up to 15 G-FLOPS on the first battery, beating the Tesla on 12 of the first 22 tests. 

For the second battery, the Xeon Phi outperformed the Tesla on 18 of the 22 tests, achieving a peak of 120 G-FLOPS with over 60 G-FLOPS on eight of the 22, whereas the Tesla never quite achieved 60G FLOPS on any of the 44 tests. 

The Ohio State researchers also compared the Xeon Phi to several other configurations, including a different model Tesla (C2050) as well as against conventional multi-core “Westmere” and “Sandy Bridge” Xeon processors.

The analysis also includes some interesting findings regarding the bandwidth and latency of the Xeon Phi memory space, compared to both Tesla GPUs and conventional multi-core Xeon processors. Read details in Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi.

Posted on February 20, 2013 by R. Colin Johnson, Geeknet Contributing Editor
23 comments
KirkAugustin
KirkAugustin

@GennaGrennadievich,  Whether code is going to be accelerated by a GPU or by the general purpose Xeon Phi coprocessors, it will have to be rewritten to some degree.  But not only is most existing code already written for x86 achitechture, but it is also much easier to write for.  The whole point of threads is that they share local memory, so there is no special message passing, nor does the x86 have any different time delays for message passing or waiting on semaphores.  All computers have this same problems.  Cache coherency can a real problem, but the Xeon Phi has solved that.  It is true that some problems can be more quickly solved by the matrix routineson a GPU, but most can not, and general purpose processor are always easier to program for.  The x86 architechture is the most common, so is the easiest to program for and has the most existing code.  Semaphore fighting is because of bad algorithms, and has nothing to do with the processor type or hardware. 

rd1rd2
rd1rd2

"the K20 with 2,496 cores each running at .7 GHz–while using only 61 cores each running at 1.1 GHz"  

This is a bad apples to oranges comparison.  The concept of a GPU core is basically a marketing invention rather than something that can run it's own separate thread of instructions like a CPU core - comparing the two directly is highly misleading, don't be fooled by the hype.  

Put another way, each Xeon Phi CPU core has a vector processing unit that can do 16 floating-point operations per clock cycle. So, if we count "cores" like a GPU each CPU core should be counted as 16 GPU cores, in total the Xeon Phi should be considered roughly similar to 61*16 = 976 GPU cores.  Then, 976 * 1.1GHz gives you Intels claimed peak of about 1 TFLOP.

R Colin Johnson
R Colin Johnson

@rd1rd2You are right--the reason there are fewer Xeon Phi cores is, as you say, they are a complete CPU. The larger number of GPU cores is because they are actually shaders--meant to calculate graphics effects in parallel, each dedicated to a small area of the screen. Writing Cuda software that uses them for other purposes has made them into general purpose accelerators, but you can't change the fact that they are not CPUs and were never meant to be. Thanks for your insight about the Xeon Phi having 16 "GPUs inside", I might use that metric in the future!

atomicenxo
atomicenxo

Not everywhere, all the time. No chance you work there, is there...? 

strongarmed
strongarmed

Interesting article.  But none of this matters- after all, ARM is the future.