Programming multicore processors like the Xeon E5 for optimal performance isn’t easy, but because of the shared memory model, and hardware support for cache coherence, a little effort can go a long way toward harnessing multiple processors to accelerate applications. And with the help of the growing library of OpenMP algorithms that have already been optimized for multi-core acceleration, programmers are starting to see light at the end of multi-processing tunnel.
ScaleMP virtualizes dual Xeon E5-2600 processors with 128 Gbytes and a 50-core Xeon Phi coprocessor board with 8 Gbytes of memory, making it appear to programmers as a virtual SMP with a 66-core Xeon processor and 136 Gbytes of memory. Source: ScaleMP
At first glance, however, the new Intel Many Integrated Core (MIC) architecture for the forthcoming massively parallel Xeon Phi family would require a different style of programming. The exact same techniques cannot be used with the 50-core Xeon Phi, because it locates those extra cores on a PCIe coprocessor card. The Xeon E5 host processor shares its memory among eight on-chip cores, but not among the Xeon Phi cores, which have their own memory space. Programmers can just use the host Xeon E5 just for housekeeping, and load their parallelized algorithms into the Xeon Phi, but then their parallel algorithms are restricted to the memory available on the coprocessor card. Fortunately, there is a solution that lets programmers ease into the MIC architecture, and Intel is working with ScaleMP Inc. (Cupertino, Calif.) to make this transition (nearly) painless.
ScaleMP already offers its Versatile Symmetric Multi-Processing (vSMP) to users who want to turn multiple x86 servers into a single virtual machine (VM), and has pledged that it will have adapted vSMP to Intel’s massively parallel Xeon Phi when it is introduced this fall. By installing ScaleMP’s vSMP Foundation below the operating system on the Xeon E5 host system, into which a Xeon Phi card is plugged, programmers can treat the whole system as if it were a single symmetric multi-processor.
“We give programmers transparent access to all of the cores and memory in an Intel MIC regardless of how many Xeon E5s and Xeon Phi’s are installed in it,” says ScaleMP chief executive officer Shai Fultheim. “And all the code they have already written for multicore Xeons will run on the new coprocessor-based MIC architecture without alteration.”
For example, consider a host with dual Xeon E5-2600 processors, each with eight cores and 128 Gbytes of memory, and a 50-core Xeon Phi coprocessor board with 8 Gbytes of memory. Installing vSMP Foundation will make the system appear to be a single VM – a 66-core Xeon processor with 136 Gbytes of memory. OpenMP and custom programs that already run on the Xeon E5-2600 will also run on this system accelerated by the Xeon Phi coprocessor without any alterations.
Parallel code that requires more than the 8-Gbytes of memory on the Xeon Phi card will run without alteration, making use of the entire memory space on the Xeon E5 host. ScaleMP says that its use of “intelligent” cache techniques over PCIe overcomes memory latency, and therefore incurs almost no performance degradation. And even code that uses MMX, SSE, or AVX instructions that are not supported by the Xeon Phi will run without alternation, because vSMP traps those instructions and emulates them.
“We trap and emulate any instructions not supported by the Xeon Phi, so that parallel code that works now on a single Xeon processor will also run the first time with a Xeon Phi coprocessor installed,” Fultheim explains. “Emulation, of course, will slow the algorithm down somewhat, but our system shows the programmer which lines of code are affected so that they can rewrite them at any time to regain the lost speed.”
Colin Johnson is a Geeknet contributing editor and veteran electronics journalist, writing for publications from McGraw-Hill’s Electronics to UBM’s EETimes. Colin has written thousands of technology articles covered by a diverse range of major media outlets, from the ultra-liberal National Public Radio (NPR) to the ultra-conservative Rush Limbaugh Show. A graduate of the University of Michigan’s Computer, Control and Information Engineering (CICE) program, his master’s project was to “solve” the parallel processing problem 20 years ago when engineers thought it would only take a few years. Since then, he has written extensively about the challenges of parallel processors, including emulating those in the human brain in his John Wiley & Sons book Cognizers – Neural Networks and Machines that Think.