SC12, supercomputing.org’s U.S. conference, held recently in Salt Lake City, UT, drew some of the best and brightest in scientific, engineering and enterprise supercomputing and HPC together to share ideas, view new tech and solutions, and (gently) compete for the benefit of science and the industry.
On Nov.15, Intel blogged that Ping Tak, Peter Tang, Jongsoo Park, Daehyun Kim and Vladimir Petrov of Intel’s Software and Services Group and Intel Labs, had won “best paper” for their research into ways of speeding up Fourier Transforms performed in parallel on systems where the cost of communication to overall execution speed is non-negligible. This mostly refers to clusters, where lots of communication happens over a network fabric. But it’s worth noting that even on multi-core dies, or among threads running on the same core, communications always costs in terms of time, power, and code complexity.
While intercore communications on new manycore boards like the Xeon Phi 5110p may happen fast over the coprocessor’s bidirectional ring, layers of software abstraction – needed to adapt software to the platform and speed development and porting – add their costs too. And as these manycore coprocessors evolve to resemble – and eventually share code with – actual clusters or virtualized datacenters, communication costs will remain a sensitive budget issue for parallel software performance.
In fact, incorporating more and more communications and distribution is one important path whereby supercomputers are becoming radically cheaper, more ubiquitous and more programmer- and enterprise-friendly. So in a more and more cluster-oriented world, the cost of communications to parallel application performance is likely, if anything, to become a bigger and bigger issue – especially for folks who want to run the most intense algorithms on the biggest datasets.
Indeed, as the winning Intel paper, titled “A Framework for Low-Communication 1-D FFT”, points out in its introduction, this has necessitated a real change in intellectual focus for algorithm researchers. In the past, improving the performance of linear FFT (Fast Fourier Transform: an approach to calculating the Discrete Fourier Transform – DFT – on finite sets) algorithms has been a matter of connecting just the right, sparse data with more sophisticated arithmetic; whereas new approaches take a path to optimization that seeks to minimize communication cost – sometimes doing so by taking a step or two backward in terms of arithmetic sophistication.
In the case of the Fourier Transform, the key insight, made in the early 1800s by Carl Freidrich Gauss and reinvented for computers by James Cooley of IBM and John Tukey of Princeton in the mid-60s, was that you can speed up DFT calculation by factoring a highly-composite sample of size N, and solving smaller DFTs for each factor, reducing computation time from O(N^2) to O(N log N). The problem is that standard parallel algorithms for doing so, in following the Cooley-Tukey model, decompose the computation into smaller-scale FFT tasks whose results must be recombined via three all-to-all transposes – a paradigm called (by those capable of understanding the math) ‘triple all-to-all.’
What the Intel team figured out was a new way of managing the decomposition by convoluting the sample to increase the number of sample points, then sparsely processing this ‘inflated’ sample conventionally, and combining results via a single all-to-all transpose. Minimal additional communication overhead is incurred in the convolution, because only nearest-neighbor processors need talk to one another in that step.
The result – pretty thrillingly – is a family of single all-to-all, in-order, O(N log N) DFT factorizations that are between half-again and twice as fast as well-known competing algorithms on a range of cluster test-beds. The best speed increases exploit an additional capability of the new method (called SOI) to work faster if requirements for accuracy are relaxed.
The paper is well-worth reading – even non-mathematicians (like the current author) can follow the general argument; and for those with the math background to follow the equations, the authors present a rigorous theoretical foundation for their work before anatomizing it: so the paper serves as a great backgrounder for those seeking state-of-the-art understanding of FFT as well as those with specific.