Data-Flow for MIC Cuts Multi-Thread Overhead Share your comment!

Bookmark and Share

For massively parallel processors using many-integrated-core (MIC) processors, such as multiple 60-core Xeon Phi coprocessors each with 240 threads, applications today can use a message passing interface (MPI) for internode communications and shared memory for coordinating tasks on a single node using OpenMP, Pthreads or OpenCL.

Unfortunately, all these techniques become less effective as more cores are added to a system. For next-generation exascale processors with thousands of nodes, each with hundreds of cores, task scheduling overhead needs to be made more efficient for high-performance-computers (HPCs).

One promising solution: data-flow management techniques. These increase efficiency by standardizing tasks into codelets–small fragments of programs with known dependencies and constraints–then performing dynamic scheduling that maps tasks (whose dependencies have been satisfied) to processor resources in real time.

Data-flow Architectures Speed Task Scheduling

Data-flow architectures allow data structures, task execution and global control of massive MICs without the usual message-passing and cache coherence penalties of MPI, OpenMP, Pthreads and OpenCL, according to ET International (ETI, Newark, Del.) which recently announced it had ported its data-flow multiprocessor management suite–Swarm–to the Xeon Phi. Now massive MICs can avoid the overhead of synchronous task scheduling, instead opting for an asynchronous model that dynamically manages task allocation in real time.

“The problem is that synchronous tasks all reach a barrier, some taking a lot longer than others to get there due to cache coherence, memory- and other-contentions that make it very difficult to load-balance hundreds of threads,” explains Rishi Khan, vice president of research and development at ETI. “For massive arrays of Xeon Phi coprocessors, for instance, Swarm eliminates the overhead of having to synchronize all those many-integrated-cores, increasing execution efficiency and solving the scaling problem.”

ETI’s SWift Adaptive Runtime Machine–Swarm–was aptly named for next-generation of parallel processors that seek harness massive computing arrays to divide-and-conquer task execution. Using asynchronous load-balancing techniques based on a data-flow memory model, Swarm optimizes core utilization in massively parallel many-integrated-core (MIC) architectures with asynchronous fine-grain data-flow management among compact codelets, rather than traditional synchronous techniques using shared memory or message passing.

 

MPI, OpenMP and OpenCL all use the communicating sequential process (CSP) execution model (left) which treats each thread as an independent machine that runs for an arbitrarily length of time, that makes use of arbitrary memory locations and which is oblivious to other threads. Swarm's data-flow execution model (right) uses uniform-sized codelets with known control and data dependencies that allow faster execution since codelets run without the usual latency and blocking operations. SOURCE: ETI

Intel has chosen to partner with ETI in its bid to develop an exascale high-performance parallel processor for the Defense Advance Research Project Agency’s (DARPA’s) Ubiquitous High Performance Computing (UHPC) program, whose goal is to build extremely efficient extreme-scale many-core processors. DARPA’s UHPC program runs through 2018 and included ETI’S Khan among its principal investigators.

 “The problem with both shared-memory and message-passing are barriers–tasks always ends up waiting for other tasks, since some parts of programs just run faster than other parts. And the problem just gets increasingly worse as you add cores,” says Khan. “But with SWARM’s asynchronous task dispatching, these load balancing issues are addressed with fine-grain multi-threading rooted in data-flow technologies.”

Swarms roots in fine-grain multi-threading derive from ETI’s founder and president Guang Gao, an engineering professor at the University of Delaware where he pioneered multi-threaded data-flow techniques that expose implicit parallelism in algorithms, thus mitigating the latency issues that are becoming increasingly important for massive MICs.

 

Posted on by R. Colin Johnson, Geeknet Contributing Editor
13 comments
christopher1
christopher1

I see where your coming from, but it would seem to me that a compiler for a massively parallel architecture would be able to look at the tradeoffs between carrying out the same step multiple times so that interim results would be available simultaneously vs using less cpu cycles but having to wait.

Rishi Khan
Rishi Khan

@christopher1 Salient point.

We do not address this in SWARM to date except in the following way: if you schedule a codelet without any constraints, it is scheduled to your local core and will only be moved in another core is not busy and can steal it (like Cilk). 

In development for the future, we are attacking this problem on two fronts:

1. We are working (through the DOE XStack program) on compile techniques to intelligently place data/tasks and schedule tasks statically.

2. We are working on using the information on the locality of the input variables in such a way that will inform where to schedule the codelet to trade off concurrency with data locality.

Rishi Khan
Rishi Khan

@bjchip 

Sure, SWARM is just a source-to-source compiler, scheduler, load balancer, macros, network layer all on top of PThreads, Linux, and TCP/IP or Infiniband.

There is no magic. It's a framework to make your life easier and your programming more productive for parallel algorithm development.

bjchip
bjchip

@Rishi Khan  -  Thanks for that,   I think where I was going with my comment is that asynchronous scheduler/load-balancers are  (if I am understanding your usage correctly)  basic to an RTOS  but not to the more common major OS implementations.   I come out of the RTOS-embedded world and it took me a while to remember that these features are not a standard part of other people's environments.   (growing old is not for sissies)

Getting an RTOS builder (not me, I just use 'em) involved might be useful with those aspects?

Making the parallelism "convenient" is quite important as most implementations of threading tend to be annoying enough that normal people would not create 100's of separate threads for things the size of a codelet... (not even sure I would do it, though I'd probably build some macros to help).   So it will DEFINITELY make life easier.      

Well it would if I could afford a Phi to play with ;-)    

Rishi Khan
Rishi Khan

@christopher1  Exposing implicit parallelism has been going on for a long time. However, standard programming models (i.e. MPI, OpenMP, OpenCL) all use fork/join/barrier style synchronization to create artificial serialization points. By 'artificial', I mean that there is no need to serialize other than the programming model doesn't allow you to express the problem otherwise. For example, say you had an algorithm that used all the threads on a node to produce some vector that needed to be used by other machines and at the same time it would accept vectors from other machines, and when it got 4, it would do a further computation. This is exactly a 2D jacobi. In standard programming models, you would have to implement this as:

while(!done) {  compute();  communicate();  BARRIER();}

sounds okay until the compute is not load balanceable (think computational fluid dynamics of a moving helicopter blade).

This is where SWARM can help in that it can allow you do to lots of these tiles on a machine that are geographically disperse and asynchronously work on the computation and communication as data is available.

Now throw in some GPUs and a Xeon Phi and the load balancing and heterogeniety become overwhelming using MPI/OpenMP/OpenCL.

R Colin Johnson
R Colin Johnson

The way Khan described it to me, it will be beyond 256 threads where the real advantages of asynchronous scheduling will become obvious.

bjchip
bjchip

As near as I can tell, this is using a parallelizing compiler and some fairly nice macro wrappers to encapsulate pthreads that aren't all that well protected inside an RTOS environment.    

It is perhaps useful to make access to the parallelism more quicker and more convenient.  The interface to the GPU is interesting for its massive parallelism but I am still "not quite getting it".    I don't see a breakthrough, just a combination of some fairly well recognized technologies.    Nice but...  if I had a Phi to play with, and a parallelizing compiler and the time to do it I'd probably make some macros myself... maybe not much different from this because its sort of obvious to me to do it.  

OTOH, I might in that position, pay for this as there are few things as futile in this business as RE-doing and testing something that someone else has already done and tested.   :-)

christopher1
christopher1

OK some of this is old hat.

Trying to expose implicit parallelism has been going on for a long time.

WHY SHOULD IT WORK NOW JUST BECAUSE THE PROCESSOR COST IS COMING DOWN?

R Colin Johnson
R Colin Johnson

@christopher1 Khan said that the implicit parallelism was revealed by changing the dataflow into equal-sized codelets--with known dependencies and constraints.

christopher1
christopher1

@R Colin Johnson @christopher1You said existing methods of parrelization = I mean that there is no need to serialize other than the programming model doesn't allow you to express the problem otherwise

How has the programming model changed so that no unnecessary serialization occurs?

It sounds to me that you are just taking advantage of the additional horsepower provided by inexpensive cores.

Rishi Khan
Rishi Khan

@christopher1 Good question. The programming model is changed by instead of producing a serial program with parallel regions (ala OpenMP), tasks and their dependencies defined (some statically, and most dynamically). This allows for asynchronous execution of all tasks synchronizing only where necessary.

You said: It sounds to me that you are just taking advantage of the additional horsepower provided by inexpensive cores.

This is also the case. However, note that these 'inexpensive cores' are connected by a bandwidth constrained ring cache bus and contention between caches can cause arbitrary stalls. Therefore, a fork/join model is plagued not only by amdahl's law but also by computational load imbalance (or more appropriately, poorly balanced bulk synchronous tasks due to resource contention).

atomicenxo
atomicenxo

Colin - Interesting. How "now" is this?