For massively parallel processors using many-integrated-core (MIC) processors, such as multiple 60-core Xeon Phi coprocessors each with 240 threads, applications today can use a message passing interface (MPI) for internode communications and shared memory for coordinating tasks on a single node using OpenMP, Pthreads or OpenCL.
Unfortunately, all these techniques become less effective as more cores are added to a system. For next-generation exascale processors with thousands of nodes, each with hundreds of cores, task scheduling overhead needs to be made more efficient for high-performance-computers (HPCs).
One promising solution: data-flow management techniques. These increase efficiency by standardizing tasks into codelets–small fragments of programs with known dependencies and constraints–then performing dynamic scheduling that maps tasks (whose dependencies have been satisfied) to processor resources in real time.
Data-flow Architectures Speed Task Scheduling
Data-flow architectures allow data structures, task execution and global control of massive MICs without the usual message-passing and cache coherence penalties of MPI, OpenMP, Pthreads and OpenCL, according to ET International (ETI, Newark, Del.) which recently announced it had ported its data-flow multiprocessor management suite–Swarm–to the Xeon Phi. Now massive MICs can avoid the overhead of synchronous task scheduling, instead opting for an asynchronous model that dynamically manages task allocation in real time.
“The problem is that synchronous tasks all reach a barrier, some taking a lot longer than others to get there due to cache coherence, memory- and other-contentions that make it very difficult to load-balance hundreds of threads,” explains Rishi Khan, vice president of research and development at ETI. “For massive arrays of Xeon Phi coprocessors, for instance, Swarm eliminates the overhead of having to synchronize all those many-integrated-cores, increasing execution efficiency and solving the scaling problem.”
ETI’s SWift Adaptive Runtime Machine–Swarm–was aptly named for next-generation of parallel processors that seek harness massive computing arrays to divide-and-conquer task execution. Using asynchronous load-balancing techniques based on a data-flow memory model, Swarm optimizes core utilization in massively parallel many-integrated-core (MIC) architectures with asynchronous fine-grain data-flow management among compact codelets, rather than traditional synchronous techniques using shared memory or message passing.
<em>MPI, OpenMP and OpenCL all use the communicating sequential process (CSP) execution model (left) which treats each thread as an independent machine that runs for an arbitrarily length of time, that makes use of arbitrary memory locations and which is oblivious to other threads. Swarm's data-flow execution model (right) uses uniform-sized codelets with known control and data dependencies that allow faster execution since codelets run without the usual latency and blocking operations. SOURCE: ETI</em>
Intel has chosen to partner with ETI in its bid to develop an exascale high-performance parallel processor for the Defense Advance Research Project Agency’s (DARPA’s) Ubiquitous High Performance Computing (UHPC) program, whose goal is to build extremely efficient extreme-scale many-core processors. DARPA’s UHPC program runs through 2018 and included ETI’S Khan among its principal investigators.
“The problem with both shared-memory and message-passing are barriers–tasks always ends up waiting for other tasks, since some parts of programs just run faster than other parts. And the problem just gets increasingly worse as you add cores,” says Khan. “But with SWARM’s asynchronous task dispatching, these load balancing issues are addressed with fine-grain multi-threading rooted in data-flow technologies.”
Swarms roots in fine-grain multi-threading derive from ETI’s founder and president Guang Gao, an engineering professor at the University of Delaware where he pioneered multi-threaded data-flow techniques that expose implicit parallelism in algorithms, thus mitigating the latency issues that are becoming increasingly important for massive MICs.