Ask James Reinders: Multicore vs. Manycore 5 Comments

Leading edge insight and explanation from James Reinders, Director, Software Evangelist, Intel Corporation. Conducted by Geeknet Contributing Editors Jeff Cogswell and John Jainschigg.

Go Parallel: You talk about Multicore vs. Manycore.  Are those separate technologies?

James Reinders:  Yeah, they’re not necessarily the most perfectly defined items. I would define them by saying Multicore really started in earnest around 2005, and it’s been an incremental approach to putting on a chip designs that were already in small computers.  We used to have computers with two or four processors in them; now we have them on a single chip. Multicore seems rather incremental.

Manycore represents a little different concept. What  if you’re willing to put a lot more cores on a single device, what changes? Two things change: One, you have this revelation that you’re going to be highly parallel. And so the way you design the hardware also changes  because you start to optimize assuming only parallel programs. The the other thing that changes is the software has to be parallel. I sometimes call these highly-parallel devices. We have the Intel MIC architecture which realizes this concept, and the Intel Xeon coprocessor

It’s a variable argument in computer architecture; there’s no right answer.  Do you want a small number of really powerful processors or a large number of less-powerful ones?

There’s great research that’s gone on in this area for decades, going back to one of the earliest papers, a   thesis by Danny Hillis  who eventually founded Thinking Machines Corporation and built the Connection Machine parallel supercomputer. With that particular machine, I would say one of the lessons in it was that they went too far being simple. Too many things  were simple, and they had to evolve their architecture. They definitely went the direction of adding more capabilities until eventually, like many startups, they failed as a business are largely looked at as creating a lot of brilliant people and technology. 

In any case, it’s an exploration, and to this day we’re still exploring the problem. And there isn’t a right answer. It depends so much on what you’re trying to do and actually having that breadth is very valuable for the industry to have different capabilities to match different needs.

Interviews are edited lightly for brevity and clarity.

 Check out:

Posted on by Jeff Cogswell and John Jainschigg, Geeknet Contributing Editors
5 comments
Rufina Russer
Rufina Russer

Excellent write-up, I am going to bookmark this.

Richard Rankin
Richard Rankin

Most of the stuff I'm interested in is non-deterministic such as pattern matching, markov models, Monte-Carlo simulations, genetic algorithms and genetic programming (very different things), neural nets, machine learning methods, etc. Gathering, storing, classifying (sort of like indexing) and analyzing massive amounts of data often repeatedly using different analysis tools and/or parameters.

Richard Rankin
Richard Rankin

I went my boss once and "We've got a big problem." He said "We don't have problems, we have opportunities." I said "We've got a big opportunity." This is an algorthmic shift. You mention tightly-coupled and loosely coupled algorithms and their amenability to parallel processing. Writing parallel code as opposed to serial code is creating new algorithms. There is no code conversion like C to C++ or something.It's seeing the problem in a whole new way. Parallel programming is hard. Heterogeneous parallel programming is really hard. At first glance, it can scary. But once you start doing it and if you're good at it, you get creative and see things differently. As opposed to mathematics, in computer science determinism refers to bahavior, not input-output. Wring non-deterministic code can be quite thrilling.

Martin Farrow
Martin Farrow

Multi-core vs Many Core - which is best.

Having worked with Sequent Computers on their range of parallel processing, SMP architecture Unix during the early 90s and Sun Micro-Systems on the Star Fire technologies during the late 90s and then later with commodity GRID technologies leveraged by Infiniband and Myrinet connection technologies I think this question is right up my street.

All of the following assumes that your compilers work correctly.

The key issue comes down to the work load with which the processors will be presented, the secondary issue being the cost to deliver that work load, both in terms of deployment cost and then of operational consumptive cost.

Firstly you need to decide if your work load will be throughput bound or peek performance bound, this will give you a key indication of the likely final solution. Peak performance bound workloads, are workloads that by their algorithmic nature can't be broken down into parallel computation operations, in this case, assuming a fixed budget is in place, one has no choice but to reduce the overall number of processors and go for higher performance units.

If your workload is throughput bound you then potentially have a choice between a large number of lower performance processors or a smaller number of higher performance processors. This is where things get interesting - The next question is your workload parallel or transaction processing. If it's parallel, then you need to look at the algorithm used and ask is it tightly or loosely coupled. Tightly coupled parallel algorithms can be divided up into smaller discrete tasks, but one finds that these tasks require regular interaction. Loosely coupled algorithms don't require regular interaction, typically only rendezvousing at the end of the calculation to assemble the result. An example of a tightly coupled algorithm is matrix inversion, used in finite element analysis, a loosely coupled algorithm is light ray tracing to create computer generated images.

If you have a loosely coupled algorithm then you know that many core is going to work for you as the algorithmic nature of the workload means that the ratio of inter process communication to actual work done is going to be lower than in the tightly coupled case. This means that the interprocessor bus architecture won't be as stressed and can be cheaper - thereby increasing your processor $ budget relative to your bus architecture $ budget.

If you have a tightly coupled algorithm then you know that in order to keep your processors running at full capacity you need a higher throughput processor to processor bus architecture. In this case it makes sense to have fewer higher performance multi-cored processors, as the more cores you have statistically the less likely you are to get an interprocessor transfer across the system bus.

If you don't have a parallel algorithm but you have a transaction processing environment, characterized by many 'smaller' discrete process then you will probably get more performance with a many core architecture, as typically these workloads present like a loosely coupled algorithm. They do however tend to be highly dependent on I/O capabilities of the architecture, as typically you are performing some kind of data processing.

Finally calculating how you will spend your money gets very interesting, as you have to take into consideration the physical foot print cost, the operational costs (power and cooling) and the hardware costs. You also need to consider your time based $ spend profile and compare this with how your expect your workload to grow. $ spend in the future always gets you more compute power, so its important not to over spec your hardware at the start. Traditionally high performance multi-core solutions tend to force you into scale-up style hardware purchasing, whereby you discard the current hardware in favour of the next bigger machine when you run out of computing power. Likewise many core solutions can typically be scaled-out leading to more efficient use of the IT budget as you don't need to throw way what you already have to increase your computing power.

Richard Rankin
Richard Rankin

That reminds me of the late Seymour Cray saying "If you want to plow a field, do you want a plow pulled by 2 bulls or 1024 chickens?" They (there have been a lot of Cray companies though) have sort of gone down the chicken route these days but as with most of the supers in the top ten, they're doing heterogeneous massivelly parallel processing with fewer x86-64 master CPUs and then lots of GPU cores. In the long run isn't the real issue for supercomputers going to be watts/flop? When they start a big job in Los Alamos are the lights going to go dim in Santa Fe?

But what I see is that your real market is not going to be a few big buck supers but the smaller, high performance teraflop level machines and clusters like I build for quants, actuaries, marketing analysts, DMA traders, etc. Right now I use a 6-core Intel or 2 and a couple Nvidea Tesla 2070's or 2075's. When I can get Xeon Phi's to try I'll see how they work out.

I was just the Int'l Conference on Very Large Databases in Istanbul and saw a paper using GPUs to do SIMD. Whether people are going to use the vector extensions extensively is a big question. What do the vector units of the Xeon Phi co-processors look like in this area?