Over the past several days, most every high-performance computing (HPC) website on the Internet has covered the superficials of the Department of Energy (DOE) award of two initial research subcontracts, totaling $19 million, to Intel Federal LLC, as part of DOE’s Lawrence Livermore National Security, LLC (LLNS)-managed Extreme-Scale Computing Research and Development “FastForward” program (see asc.llnl.gov/fastforward for RFP and appendices).
FastForward is the initial, two-year program funding basic R&D on CPUs, memory, storage, and I/O in support of creating – by 2020 or thereabout – computers a thousand times more powerful than we have today. Reaching the exascale by decade’s end is also a major goal for Intel, and its overall strategy – particularly in light of recent acquisitions of interconnect maker QLogic, and of Cray’s interconnect technology, Aries, has made ever-more-clear its intention of assembling a complete IP, product, and talent portfolio in pursuit of the exascale goal. Its announcement, four days ago, of acquiring WhamCloud – creators of the Lustre file system used in a majority of the world’s fastest supercomputers – emphasizes the point, particularly since WhamCloud was also a FastForward awardee.
The relatively small value of initial FastForward awards belies the criticality and scale of the national effort that will surely follow this initial R&D phase – a “we reach the Moon”-style undertaking, modeled by its participants as a dynamic public/private partnership. The acute need for such an effort is revealed in DOE’s Statement of Work for FastForward, available along with other documents as a .zip archive.
A must-read document
The document, called “04_FastForward_SOW_FinalDraftv3.docx,” deserves wide attention. Exceptionally well-written and representing significant expertise and vision, it clarifies both the stakes in reaching exascale capability, and the physical and economic roadblocks this effort will need to overcome. And it begins by explaining why DOE is spearheading this effort, in tandem with the National Nuclear Security Administration.
The need in every field of science and technology is pretty clear. Between 2008 and 2010, DOE program offices sponsored a series of workshops to identify what they call ‘grand challenges’ for exascale modeling and simulation. Reports on those workshops are required reading for anyone hoping to chart the near- or intermediate-term future of HPC, and evaluate its potential for changing the game in different fields of study and their attendant markets.
The bottom-line lesson common to all those reports, however, is an understanding that – given present technologies – computing, data transfer, memory, and storage energy requirements will be a major stumbling block to efforts to perform an exaflop of number-crunching on an exabyte of data. Given current tech, doing that much computing that fast would require more than $2.5B in expenditure per year, to support a gigawatt of constant load (the SOW laconically notes that this is ‘(more than many power plants currently produce).’)
The DOE (and obviously, years of research within the HPC community more generally) has determined that there’s a factor-of-five shortfall in power efficiency that needs to be closed for exascale computing to be feasible. And on top of that physics problem is an economic/business problem – which is simply that the direction most current tech roadmaps are taking a) fail to close this gap, and b) fail to do so because many profitable applications for HPC exist below the exascale. So this turns out to be an interesting situation in which the perfect is an enemy of the good, and where non-business stimulus to fundamental research can be of material assistance in overcoming the market forces that – under less stringent circumstances – serve as adequate incentives and guides to progress.
In other words, it’s all about getting the electric bill for exascale down to about 20 MW, which at this point looks hard to do.
The second of DOE’s foci is concurrency – and their analysis of why and how exascale computing changes the nature and energetics of concurrency and parallelism is worth a serious read. In this case, the ‘energy question’ focuses on the way transport energy increases as a factor in total energy use as the number of processing nodes explodes, and how vastly magnified cross-system sum latencies require yet another several layers of efficiency planning in order to keep CPUs busy. The amusing takeaway line here is “… the flattening of clock rates has one positive effect in that such latencies will not get dramatically worse by themselves,” which is the first time I’ve heard anyone find anything positive to say about hitting Moore’s Wall.
Anyone interested in the future of HPC will be well served by reading through this exceptionally clear SOW, and its tributary reports. Meanwhile, in light of the challenges it faces, it is heartening to see that the public/private effort begun in FastForward looks to be guided by people who really understand these challenges.
John Jainschigg is a Geeknet contributing editor, and is CEO of World2Worlds, Inc., a digital agency focused on immersive technology and gaming. John’s initial intro to concurrency was via interrupt and re-entrancy programming at the assembler level on Z80 and 68000-based systems. He wrote concurrent, time-critical packet-switching applications on HP-UX RISC machines in the late 1980s, and since then has worked up and down the client-server stack in Java, C++, PHP, and other conventional and scripting languages, and more recently, in task-specific, state-based, radically concurrent languages like LSL.