Independent Test: Xeon Phi Shocks Tesla GPU Share your comment!

Intel’s Xeon Phi coprocessor outperforms Nvidia’s Tesla graphic-processing unit (GPU) on the operations used by “solver” applications in science and engineering, according to independent tests at Ohio State University.

When comparing Intel’s Xeon Phi to Nvidia’s Tesla, most reviewers dwell on how much easier it is to rewrite parallel programs for the Intel coprocessor, since it runs the same x86 instruction set as a 64-bit Pentium. 

Nvidia’s “Cuda” cores on its Tesla coprocessor, on the other hand, do not even try to emulate the x86 instruction set, opting instead for more economical instructions that allow it to cram many more cores on a chip.

As a result, Nvidia’s Tesla has 40-times more cores (2,496) than Intel’s Xeon Phi (60). The question then becomes: “is it worth it” to rewrite x86 parallel software for Nvidia’s Cuda, in order to gain access to the thousands of more cores available with Tesla over Xeon Phi?

Intel’s Xeon Phi SE10P (red) beat Nvidia’s Tesla C2050 and K20 GPUs (light and dark green, respectively) in 18 out of 22 tests. The Xeon Phi also beat dual Xeon X5680s (each with six cores for 12 cores total, light blue) and dual Xeon E5-2670s (each with eight cores for 16 total, dark blue) in 15 out of 22 tests. Source: Ohio State

To find the answer, Ohio State decided to narrow down the question to the types of parallel programs scientific researchers run regularly. For the test, researchers chose the parallel processing operations routinely performed on large sparse matrices. Variously called eigensolvers, linear solvers and graph-mining algorithms, these applications encode vast parallelism into wide-dense vectors multiplied by the large sparse matrices.

The results? Xeon Phi outperformed even the fastest Tesla coprocessor–the K20 with 2,496 cores each running at .7 GHz–while using only 61 cores each running at 1.1 GHz.

The coprocessors were tested on two batteries of 22 matrix operations–44 total–resulting in speeds ranging from 4.9-to-13.2G FLOPS for Tesla on the first battery.

The Xeon Phi, on the other hand, achieved up to 15 G-FLOPS on the first battery, beating the Tesla on 12 of the first 22 tests. 

For the second battery, the Xeon Phi outperformed the Tesla on 18 of the 22 tests, achieving a peak of 120 G-FLOPS with over 60 G-FLOPS on eight of the 22, whereas the Tesla never quite achieved 60G FLOPS on any of the 44 tests. 

The Ohio State researchers also compared the Xeon Phi to several other configurations, including a different model Tesla (C2050) as well as against conventional multi-core “Westmere” and “Sandy Bridge” Xeon processors.

The analysis also includes some interesting findings regarding the bandwidth and latency of the Xeon Phi memory space, compared to both Tesla GPUs and conventional multi-core Xeon processors. Read details in Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi.

Posted on by R. Colin Johnson, Geeknet Contributing Editor
23 comments
KirkAugustin
KirkAugustin

@GennaGrennadievich,  Whether code is going to be accelerated by a GPU or by the general purpose Xeon Phi coprocessors, it will have to be rewritten to some degree.  But not only is most existing code already written for x86 achitechture, but it is also much easier to write for.  The whole point of threads is that they share local memory, so there is no special message passing, nor does the x86 have any different time delays for message passing or waiting on semaphores.  All computers have this same problems.  Cache coherency can a real problem, but the Xeon Phi has solved that.  It is true that some problems can be more quickly solved by the matrix routineson a GPU, but most can not, and general purpose processor are always easier to program for.  The x86 architechture is the most common, so is the easiest to program for and has the most existing code.  Semaphore fighting is because of bad algorithms, and has nothing to do with the processor type or hardware. 

rd1rd2
rd1rd2

"the K20 with 2,496 cores each running at .7 GHz–while using only 61 cores each running at 1.1 GHz"  

This is a bad apples to oranges comparison.  The concept of a GPU core is basically a marketing invention rather than something that can run it's own separate thread of instructions like a CPU core - comparing the two directly is highly misleading, don't be fooled by the hype.  

Put another way, each Xeon Phi CPU core has a vector processing unit that can do 16 floating-point operations per clock cycle. So, if we count "cores" like a GPU each CPU core should be counted as 16 GPU cores, in total the Xeon Phi should be considered roughly similar to 61*16 = 976 GPU cores.  Then, 976 * 1.1GHz gives you Intels claimed peak of about 1 TFLOP.

atomicenxo
atomicenxo

Not everywhere, all the time. No chance you work there, is there...? 

strongarmed
strongarmed

Interesting article.  But none of this matters- after all, ARM is the future.

R Colin Johnson
R Colin Johnson

@rd1rd2You are right--the reason there are fewer Xeon Phi cores is, as you say, they are a complete CPU. The larger number of GPU cores is because they are actually shaders--meant to calculate graphics effects in parallel, each dedicated to a small area of the screen. Writing Cuda software that uses them for other purposes has made them into general purpose accelerators, but you can't change the fact that they are not CPUs and were never meant to be. Thanks for your insight about the Xeon Phi having 16 "GPUs inside", I might use that metric in the future!

KirkAugustin
KirkAugustin

@strongarmed ARM is only the future if IBM fails to do multi processor grids well.  The main things are tools, compilers, optimizers, etc., and x86 has advantage there.  So I don't think it is so clear what the future is yet.

japanpersonals
japanpersonals

@KirkAugustin

I can agree, the cache is wasted. So, there's actually not much difference between Xeon and i7 except core number disadvantage of i7 and its advantage in clocks (50/50 I guess)

But do you think, aggregating data in optimal way on the disk with unknown structure is easier job than creating optimized mutex-based code for example? BTW, nand SSDs are easily killed with matrix data. Too many nonsequental rewrites are necessary while huge amounts of stored data don't allow effective wear leveling.


KirkAugustin
KirkAugustin

@japanpersonals : With large data you can create the structure you want to sequentially aggregate data into.  Then retrieve a whole chunk and move it back to local memory for computations.  This is how virtual memory historically always worked before, and it never had a significant impact if done right, because it is not done that often.

It is not that semaphores and mutexes have to always be bad, but in most cases I have seen, they were over used to create long pipelines that prevented any multi processor parallel advantage.

KirkAugustin
KirkAugustin

@japanpersonals: I agree solid state drives are the slowest of all the memory models, cache levels, etc., but if one groups their data correctly, it does not have to be swapped into faster memory very often, so does not increase total time by a significant amount.  And with data that large, you are just wiping out and wasting all your cache anyway.  Better to limit the chunk of data you focus on at a time, and keep it all in cache.

japanpersonals
japanpersonals

KirkAugustinjapanpersonalsGenaGennadievichR Colin Johnson

Yep, exactly. mutex/semaphore model should be used very carefully, but it doesn't mean it is very slow by default. But anyway, the problem with xeon is not in continuous RAM.It is just in RAM itself. There's no RAM for large real-world problems. Only way you can use this buffer is for small TD problems, where half of the ram would be used simply for prefetch to workaround bus delays.

"GPUs are built for graphics" No. If I am not mistaken, video DAC and framebuffer are created for graphics, and GPUs are created for matrix operations. In general, matrrix, yes, but methods allow to do a lot of work. I'e FD(!) 2D/3D noise filtering can be perfectly offloaded with pixel shaders alone. All the size limits in modern multipurpose cards is related to RAM. The same as in Xeon Phi. If I am not mistaken, new CUDA type cards avoid the problem of previous generation OGL/shaders oriented cards you are mentioning. AFAIK, modern GPUs are just like multicore superscalar CPUs with a lot of FP units.

"Data that size is better stored on solid state drives or broadcast."

The most stupid thing that can be ever proposed. I can feel the impact of the 2-channel operation on 4-chanel xeon when I have half of the banks absent (tried to not populate the banks). When swapping the data to HDD... this is stupid... 40-80 times longer execution... it is just simpler to solve the problem using intel's Atom when swappind the data on HDD, while computing on Xeon.

KirkAugustin
KirkAugustin

@japanpersonals@KirkAugustin@GenaGennadievich@R Colin Johnson: Sorry, but I do not understand you.  The problem with synchronization devices like semaphores and mutexes is that they often greatly slow down computations.  Unless carefully used, they cause huge delays and greatly degrade performance.  Used wrong, they prevent parallelism and cause sequential operation instead. There are much better ways to organize and synchronize computations, such as pipelines, asynchronous callbacks, and event handlers.  Just as it is wrong to try to create 32-90 GB of continuous, coherent, shared, RAM.  It simply causes slower execution.  Data that size is better stored on solid state drives or broadcast.  I created the software to solve problems like Finite Element Analysis and Modeling decades ago.  GPUs are built for graphics so have size limits, and don't fit the requirements you just demanded.  GPUs only solve fixed pipeline of operations needed for graphics, and while there are some matrix problems that can use the same pipeline, most real world problems can not.  The sequence of operations need to be different.

japanpersonals
japanpersonals

@KirkAugustin@GenaGennadievich@R Colin Johnson  

Kirk. I am sorry. It seems, you are so far from reality. In both, your ideas about threads and  necessary RAM amounts. Mutex model, or workers with message passing, both are thread models, just the method of data passing is different. Naturally, semaphores are less flexible and optimized for single-cpu systems, workers originate from supercomputer tech,with distributed machines that can't always share memory (your Xeon is the same btw)... But in PC it just doesn't matter, you can do one way, or another. And no, protected ram does not automatically provide data integrity, as long as you don't fully rely on one-tick bitwise operations. I don't know where do you program, but if you would use simple C(++) or asm, you would meet a lot of bugs when (allow)accessing data concurrently without preparing first.... Amounts of RAM: you can't imagine this because you just don't use the software, or didn't try to solve large problems yourself. By the way, did you hear about FEM? do you know why is it preferred before compact _TD solvers? I can tell you that simple antenna problems usually require about 32-90 GB of ram to see just approximate field distribution in idealized design. I heard, that people solving material stress problems in shipbuilding and earthquake/sea shelve researches require much more.

And GPU...I can't be quite sure, but seems you miss the point. GPU is exactly optimized for solving matrix problems. That's why they provide GPU power to general programming these days. But as you know, there were always limitations in memory, technically in texture size... That is the same as limitations for problem.

And flexibility.. I don't know what kind of flexibility can provide a standalone machine with i/o/ port attached to your general purpose local bus. Performance - yes. Flexibility? No more then any standalone graphics accelerator.

KirkAugustin
KirkAugustin

@GenaGennadievich @KirkAugustin @R Colin Johnson: Protected mode memory is a feature of x86, and is multithreaded by default.  Although I doubt the Xeon Phi allows virtual memory becaues that would require its own drive.  The point of threads is that they share memory and don't need message passing, and use shared memory semaphores instead.  It is hard to imagine any problem could possibly need more than 4 gig of data space.  It is true most matrix engines are optimized for 4-6 threads, because that is all you need for 3D graphics.  But that does not mean you can't use more, just that you have to write the algorithm yourself.  And that is the advantage of Xeon Phi flexibility, in that is it GPUs that are rigidly built only for 3D graphics, and are very hard to use for other things.

GenaGennadievich
GenaGennadievich

 @KirkAugustin @R Colin JohnsonMost modern numeric solvers are of cheap memory protected mode era. The code that is of expensive memory x86 era is not multithreaded by default.

So you are not right by default.

There's no difference, either you use X-phi, or GPU, the data should be uploaded to the daughterboard. 

And yes, most code in most precise FEM class methods can't be multithreaded well. So developers multithread matrix computation operations. And equation matrices are really big. 

And there's another bottleneck - messages between threads. AFAIK parallel supercomputers have special high performance message interface to backup threads waiting for cross-thread data, when computing matrices. x86/64 platform - doesn't. The matrix computation engines are easily saturated on x86 just with 4-6 threads. The system becomes overloaded with message passing or semaphore fighting instead of real computation.

I suppose, Xeon-Phi should inherit the same probl

em of x86 architecture. Correct me if I am wrong.


KirkAugustin
KirkAugustin

@R Colin Johnson, Most x86 code is from before memory became so inexpensive, so is not memory intensive.  The x86 environment uses virtual memory anyway, so the size of memory is irrelevant to x86 code.  And the Xeon Phi is an independent Linux based coprocessor array, so requires some code rewrite anyway.  But the point is that with the Xeon Phi, you do have general purpose processors, while with a GPU, you only have matrix math operations available.  It is the GPU that is extremely limited as to algorithms and requires massive code rewrites.  

KirkAugustin
KirkAugustin

I agree more contiguous memory is nice.  It is easier and faster.

GenaGennadievich
GenaGennadievich

@KirkAugustinYou can subdivide wave problems as much as you like, but hybrid solutions will never converge as well as pure wave solution. Neither they will fit with reality. The same is for hierarchic interactions. Yes... We know what the Swap is. But do you know the penalty? Still the approximations on domain junctions increase calc errors in unacceptable rates.

KirkAugustin
KirkAugustin

I have been doing this since 1968, and never has there ever been a problem that can't be subdivided.  The amount of RAM is irrelevant.  What you are claiming is that all the RAM has to be contiguous, and my contention is that is does not.  You have heard of virtual memory, right?  In 1968, the million dollar Univac I was working on had exactly 131K of RAM.  Yet there was no numerical method we could not do.  Wanting is one thing, claiming to need is something entirely different. 

KirkAugustin
KirkAugustin

Good point.  8 gig is not much these days.

GenaGennadievich
GenaGennadievich

@KirkAugustin "I have never seen a need for a large amount of RAM". That's natural. Because you are doing tests and I am doing simulations for real engineering tasks.

"All operations can always be broken down into smaller distributed operations,"

!!!? WHAT!!!??? what school did you graduate from to tell such nonsense? Many physical problems can't even be effectively multithreaded. Domain braking is available just for a few least precise numeric approaches. Most problems implementing real-size complex objects  need a lot of RAM. In RF amounts of RAM grow with frequencies On THz, 1 TB/32 Xeon cores could not be enough. And talking about particles, multiparticle interaction increase the necessary amounts in geometric progression.


R Colin Johnson
R Colin Johnson

@KirkAugustin You are right that most tasks can be rewritten to utilize less RAM, but I think those programmers who feel constrained by the 8-Gbyte on-board the Xeon Phi don't want to have to rewrite code--that's why they went with an x86 coprocessor in the first place :)

KirkAugustin
KirkAugustin

I have never seen a need for a large amount of RAM.  All operations can always be broken down into smaller distributed operations, in my experience.  Of course my experience goes back to when 64K was considered a lot of RAM, but it is not hard to distribute problems and data, and compute in parallel.  Imagine you wanted to render a complex 3D image.  Sure that could use lots of RAM, but it would be fastest if each CPU only rendered a small portion of the image, and only used a small range of RAM.