Massive Multi-Core Xeon Phi Inherits Proven Ring Topology 10 Comments

Bookmark and Share

 

Intel’s Xeon Phi – its first commercial Many Integrated Core (MIC) processor officially due out this fall – promises to bring massive multiprocessors down from the lofty heights of world-class supercomputers to the domain of enterprise servers and workstations.

 

The Knights Ferry MIC architecture board housed this 32-core Aubrey Isle processor, the forerunner of the 50-core Xeon Phi, to be available on Knights Corner boards this fall. Source: Intel

By installing 50-core Xeon Phi processors on Knight’s Corner PCIe 3.0 boards, any Xeon-based server or workstation will be able to access the teraFLOPS performance levels previously only available to government labs and well-endowed corporate researchers.

Proven technologies

If we look inside the Xeon Phi, however, we do not find exotic, untested technologies like those that have drained the R&D budgets of rival multiprocessor startups, but leading-edge semiconductor processes and architectural features that already have been proven out by existing Intel multi-core processors. 

Intel’s latest 22-nanometer CMOS process – the Ivy Bridge die shrink of its proven Sandy Bridge microarchitecture – uses its pioneering 3-D FinFET transistors that already have put Intel years ahead of its semiconductor rivals worldwide.

High-speed ring topology

But just as important to the Xeon Phi’s performance is its use of the high-speed ring architecture, which had already been perfected for Intel’s second-generation Core processors and serves as the backbone of its latest multi-core Xeon processors.

Ring topologies are considered the ultimate for on-chip communications among up to 10-core processors, but have traditionally been considered too prone to congestion for linking more than a dozen or so cores. However, for the 50-core Xeon Phi, analysts claim that widened, bi-directional rings are still viable.

“In a regular processor, the system is at the mercy of whatever code the user wants to run,” says Gartner Inc. Vice President Martin Reynolds. “But the Xeon Phi will generally be handling carefully structured workloads, where the paths around the ring can all be managed, and the code can be set up to optimize the use of the ring.”

Intel’s use of high-speed ring topologies for interprocessor communications has been proven out in the latest incarnation of its popular Xeon E5 family, which uses twin 256-bit wide rings encircling eight cores for bi-directional interprocessor communications. For the 32-core prototype chips on its Knight’s Ferry board – the predecessor to the 50-core Knight’s Corner boards due out this fall – Intel’s boosted its ring topology to 1024-bits wide, offering bi-directional 512-bit wide rings and matching SIMD units.

Real-world use cases

The tactic worked, according the Leibniz Supercomputing Centre (Germany) and the Korea Institute of Science and Technology Information (KISTI), both of which were beta-sites. CERN (Switzerland), who recently uncovered evidence of the Higgs boson, as announced earlier this month, was also a very active tester of Knight’s Ferry boards. MIC servers and workstations using Knight’s Ferry boards also have been demonstrated by Colfax, Dell, Hewlett-Packard, IBM, SGI, and Supermicro.

Interprocessor communications on-chip are well handled by rings, but communications between Xeon E5 supervisor processors and the forthcoming Knight’s Corner coprocessor boards will be over the PCIe bus. Its multiple-gigabit-per-second serial lanes also will handle interprocessor communications among Xeon Phi chips on separate Knight’s Corner boards, a topology that the Texas Advanced Computing Center (TACC) promises to assemble into a multi-petaFLOPS supercomputer configuration called Stampede. Likewise, Cray has announced it will offer Xeon Phi based coprocessors for its next-generation Cascade supercomputers.

Boosts for the future

For the future, Intel is aiming to boost the petaFLOPS performance of supercomputers based on its MIC architecture into the exaFLOPS range, which may necessitate a move to high-speed mesh interconnection topologies. Intel already has demonstrated experimental on-chip mesh interconnects for its single-chip cloud computer (SCC), as well as experimental silicon-chip-based lasers for implementing high-speed optical links for chip-to-chip comm. However, for now, Intel’s massively wide on-chip rings still have plenty of headroom for putting tera- to petaFLOPS of processing power inside MIC-enabled Xeon Phi based supercomputers, servers, and workstations.

________________________________________________________________

Colin Johnson is a Geeknet contributing editor and veteran electronics journalist, writing for publications from McGraw-Hill’s Electronics to UBM’s EETimes. Colin has written thousands of technology articles covered by a diverse range of major media outlets, from the ultra-liberal National Public Radio (NPR) to the ultra-conservative Rush Limbaugh Show. A graduate of the University of Michigan’s Computer, Control and Information Engineering (CICE) program, his master’s project was to “solve” the parallel processing problem 20 years ago when engineers thought it would only take a few years. Since then, he has written extensively about the challenges of parallel processors, including emulating those in the human brain in his John Wiley & Sons book Cognizers – Neural Networks and Machines that Think.

Posted on by R. Colin Johnson, Geeknet Contributing Editor
10 comments
R. Colin Johnson
R. Colin Johnson

Joe makes a good point. If you boost the performance of a server by adding blades, those processors can't be tightly coupled by parallel processing algorithms as is typically required for supercomputer applications.

R Colin Johnson
R Colin Johnson

Thanks for the feedback from Go-Parallel. You are right about cloud servers and reduced power budgets. In fact, I recently suggested to CEO Warren East that ARM's low power processors would be perfect for future cloud servers because of all the emphasis that was being put on cutting their energy budget--in fact, Dell has announced ARM-based servers--but East said that ARM did not want to increase their core sizes by adding the memory management units, caches and coherence circuitry needed for servers, but was content to leave that market to Intel, AMD and IBM. For its part, Intel claims its Atom is much more energy efficient than Xeon, making it a good candidate for green servers, and HP has such a Atom-based server called Gemini. AMD recently acquired SeaMicro, which is making Atom-based servers too. But IBM, on the other hand, has chosen to reduce the energy consumption, and heat budget, of its servers by running Power 7 CPU's at a lower clock frequency than Power 6. Everybody's addressing your issue, but in different ways. Will be interesting to watch.

Thanks again,

Colin

Commenter
Commenter

Hi!

apart from Purchase Price, the cost of running 24/7 with associated cooling costs and overheads (switching, routing, redundancy) will be 50% of Cloud data centre costs they say.

It may increase from 50% as data centre costs reduce, because there is no "easy" or portable way to reduce energy cost.

So we don't need phenomenally hot CPUs in 95% to 99% of future Cloud.

Joe Stoetzel
Joe Stoetzel

Cloud servers and Super Computers are two very different products.

Need more data processing capacity at a (cloud or normal) server? Just add many more blade processors, and more internet communications bandwidth. This is well known existing technology.

Super Computers need to do very high numbers of Floating Point Operations (FLOPS) per second on extremely massive data structures. This is something a super computer is very good at; it is something a server is not capable of doing (cloud or otherwise). All super computer computers have an operating system and hardware structure that is unique. The hardware design topology and the operating systems for super computers are always custom made (and very pricey)

R. Colin Johnson
R. Colin Johnson

Thanks for the comment. Hopefully mass production of the Xeon Phi and other MICs will bring down their price. The high-end supercomputer style usage model will be the first ones on-board, but the cloud servers could eventually benefit, but as you say, the price will have to be right.

Colin

Commenter
Commenter

Hi!

They need to produce low-cost low-energy MP Cloud Servers quickly, or shareholders could loose nearly ever penny they have invested in Intel.

The Quanta Tilera MIPS based 100 CPU device is one example already in production and use in The East.

The astronomical price of Intel CPU's must be designed to relieve USA Govt and State coffers primarily.

They are rapidly becoming a luxury product.

Similarly, AMD are only slightly better off.

Whatever happened to good management and direction of USA industries?

The whole disaster reminds me of GM.

R. Colin Johnson
R. Colin Johnson

Intel hasn't revealed all the details about how its board implementation will handle memory issues, nor has the date of availability been announced. Intel has stated, however, that the Xeon Phi will adhere to the basic x86 instruction set and programming model, although it does not claim binary compatibility--so code written for it will likely require some minor tweaking and a recompile.

Manuel Antolín
Manuel Antolín

With independence of width, the minimal interconnection between individual CPUs would be something kike an hypertetrahedron, or multilayer tetrahedric structure, which would give a mínimal distance of distributed intercommunication, always leavin 4 channels for control, user interface or external comms. so it woukd give a number of 4, 16,64 units, with always lett in for ampliation conserving same structure, with minimal distances.

Richard Rankin
Richard Rankin

OK, I've been doing heterogeneous parallel processing for a while with Nvidia 2070s. It's hard, but parallel programming is hard I also build special purpose machines. How is this card going to perform compared to the 2880-core Nvidia K20s being released later this year? How is memory handled? PCIe bus to main memory, memory on board, cache..? When can I get one to try out? Talking about the future is great but the competition is delivering now. If the instructions available on the primary cores is not identical to the instruction sets I need to use on the PCIe board then it's still heterogeneous processing.

R. Colin Johnson
R. Colin Johnson

Many startups have crafted massive multi-core processors as a solution to the inability to crank chip clock speeds much past 3GHz due to overheating problems, such as Tilera which has a 64-core processor that uses a proprietary instruction set. Also graphics coprocessor makers, notably Nvidia, are putting hundreds of tiny cores on their graphics-processing-units (GPUs) but which also use a proprietary instruction set. The difference with Intel's Xeon Phi is that the cores are all x86 compatible, allowing the same multiprocessing software development suites already in place to harness a massively parallel processor. Intel is packing every trick it knows about accelerating parallel processing into its many integrated core (MIC) architecture for the Xeon Phi, but all use technologies already proven out in Intel's existing processor families, making Intel's solution a single-chip-supercomputer that can be harnessed by any server or workstation already using Xeons.