<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Go Parallel &#187; Tune</title>
	<atom:link href="http://goparallel.sourceforge.net/tune/feed/" rel="self" type="application/rss+xml" />
	<link>http://goparallel.sourceforge.net</link>
	<description>Translating Multicore Power into Application Performance</description>
	<lastBuildDate>Wed, 22 May 2013 01:06:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Mere Mortals: Compile Fortran, C, C++ like a Ninja</title>
		<link>http://goparallel.sourceforge.net/mere-mortals-compile-fortran-c-c-like-a-ninja/</link>
		<comments>http://goparallel.sourceforge.net/mere-mortals-compile-fortran-c-c-like-a-ninja/#comments</comments>
		<pubDate>Wed, 15 May 2013 00:05:07 +0000</pubDate>
		<dc:creator>gpmcarollo</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=3231</guid>
		<description><![CDATA[Highly optimized computing routines are often associated with low-level programming. Assembly code, intrinsic functions, OS-level multithreading interfaces and other sharp weapons used by &#8221;ninja programmers&#8221; are believed necessary for penetrating deep into hardware and snatching every FLOP &#8212; especially when optimization is performed for a computing accelerator. In this case, the &#8221;common man&#8221; with only [...]]]></description>
				<content:encoded><![CDATA[<p>Highly optimized computing routines are often associated with low-level programming. Assembly code, intrinsic functions, OS-level multithreading interfaces and other sharp weapons used by &#8221;ninja programmers&#8221; are believed necessary for penetrating deep into hardware and snatching every FLOP &#8212; especially when optimization is performed for a computing accelerator. In this case, the &#8221;common man&#8221; with only high-level language skills need not apply.</p>
<p>Traditional HPC languages, Fortran, C and C++, have little native control over hardware capabilities, such as SIMD operations, multi-core availability and pre-fetch instructions. The burden of optimization is therefore laid upon the expert programmer. He/she must optimize general-purpose library routines with close-to-hardware coding or on the compiler, which tries to automatically arm the high-level code with some knowledge of the hardware architecture.</p>
<p>This all changed last fall, however, with the arrival of a new approach to automatic optimization on computing accelerators via Intel’s Many Integrated Core (MIC) architecture. Unlike GPGPUs, the MIC architecture can be programmed with the standard Fortran, C and C++ languages. It understands common HPC parallel frameworks such as OpenMP and MPI. But most importantly, the new suite knows how to compile Fortran, C or C++ code written by a &#8221;mere mortal&#8221; to run on the coprocessor as if optimized by a &#8221;ninja&#8221;.</p>
<p>Demonstrated automatic optimization capabilities open doors to scientists and engineers wishing to boost the performance of their general-purpose functions using the MIC architecture. A mathematical function, empirical functional relationship, differential equation solution – all can now be expressed in a high-level language and entrusted to the compiler for optimization.</p>
<p>As a bonus, the implementation of a library function in a high-level language will scale forward to future computing architectures in a blink of an eye. That is, in a swing of the compiler&#8217;s &#8221;ninjato&#8221;.</p>
<p><strong>Learn More</strong></p>
<p>This automatic optimization capability is outlined in a new <a href="http://research.colfaxinternational.com/post/2013/05/03/Fast-Library-Xeon-Phi.aspx" target="_blank">paper</a> published by Colfax Research.</p>
<p>You’ll see step by step how to construct a library of special functions and make it offloadable to an Intel Xeon Phi coprocessor. Using a C++ language extension, the authors inform the compiler that certain functions are candidates for automatic vectorization in user applications. Finally, they brush up the high-language code of the function to allow the compiler to do its best with optimization. As a result, their implementation of the Gauss error function performs on par with the highly optimized vendor implementation.</p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/mere-mortals-compile-fortran-c-c-like-a-ninja/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Improving Your Coding AND Professional Craft</title>
		<link>http://goparallel.sourceforge.net/improving-your-coding-and-professional-craft/</link>
		<comments>http://goparallel.sourceforge.net/improving-your-coding-and-professional-craft/#comments</comments>
		<pubDate>Fri, 26 Apr 2013 18:14:20 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=3101</guid>
		<description><![CDATA[&#160; Becoming a better parallel coder can also make you smarter in your professional field, according to Michael D’Mello, Program Manager, Tools Immersion Program at Intel. Increasing mastery of programming tools not only ups your development game, but over time can become “an integral part of your workflow and thought process” that yields new insights. [...]]]></description>
				<content:encoded><![CDATA[<p>&nbsp;</p>
<p>Becoming a better parallel coder can also make you smarter in your professional field, according to Michael D’Mello, Program Manager, Tools Immersion Program at Intel. Increasing mastery of programming tools not only ups your development game, but over time can become “an integral part of your workflow and thought process” that yields new insights. Find out how. Watch the interview here. (12:57)</p>
<p><iframe src="http://www.youtube.com/embed/7WD4mMc1so8?rel=0" frameborder="0" width="560" height="315"></iframe></p>
<div>
<p><strong>Key topics:</strong></p>
<ul>
<ul>
<li>Progressing from Good to Very Well to Optimal</li>
<li>New features in Intel&reg; VTune&trade; Amplifier XE 2013 and Intel&reg; Parallel Inspector XE</li>
<li>New ways to identify hotspots, memory and threading errors&nbsp;</li>
</ul>
</ul>
<h3><strong>Related Posts</strong></h3>
<p><strong><a href="http://goparallel.sourceforge.net/real-world-verification-done-fluidly/" target="_blank">Real-World Verification, Done Fluidly</a></strong></p>
<p><a href="http://goparallel.sourceforge.net/how-developers-can-handle-the-new-hardware-complexity/" target="_blank"><strong></strong><strong>How Developers Can Handle the New Hardware Complexity</strong></a></p>
<p><a href="http://goparallel.sourceforge.net/verifying-parallelization-of-c-code-with-parallel-inspector/" target="_blank"><strong></strong><strong>Verifying Parallelization of C++ Code with Parallel Inspector</strong></a></p>
<h3><strong>&nbsp;</strong><strong>Downloads</strong></h3>
<p><a href="https://makebettercode.com/cbsi/cluster_parallel/index.php?utm_source=Geeknet+-+Go+Parallel+Portal+Text+Links+-+Text&amp;utm_medium=Link&amp;utm_content=VTune+Amplifier+XE+2013+&amp;utm_campaign=2013_Intel_DPD"><strong>Intel&reg; Vtune&trade; Amplifier XE 2013</strong></a><strong></strong></p>
<p>Powerful threading and performance profiler helps improve application performance and scalability&nbsp;</p>
<p><a href="https://makebettercode.com/cbsi/cluster_parallel/index.php?utm_source=Geeknet+-+Go+Parallel+Portal+Text+Links+-+Text&amp;utm_medium=Link&amp;utm_content=Inspector+XE+2013+&amp;utm_campaign=2013_Intel_DPD"><strong>Intel&reg; Parallel Inspector XE 2013</strong></a></p>
<p>Advanced memory and thread checker helps easily find memory leaks, corruption, data races, and more.&nbsp; <strong></strong></p>
<h3><strong>SPEAKER BIO</strong></h3>
<p><strong><img class="alignleft size-full wp-image-3102" title="1" src="http://goparallel.sourceforge.net/wp-content/uploads/2013/04/12.jpg" alt="" width="90" height="90" />Michael D’Mello</strong></p>
<p><strong>Program Manager, Tools Immersion Program, Intel</strong></p>
<p><em>Analysis: Measuring System Performance, Workload Balance, and Code Efficiency—Intel&reg; VTune&trade; Amplifier XE 2013</em></p>
<p><em>Debug for Correctness—Find Memory and Thread Errors Using Intel&reg; Parallel Inspector XE</em></p>
<p>For the last ten years, Michael has been focused on tool-based approaches to software optimization. Prior to joining Intel in 2003, Michael held various technical positions at the Hewlett-Packard Company, Convex Computer Corporation, and Thinking Machines Corporation. He has more than 20 years of experience in the parallel computing industry. Michael received a Ph.D. in chemical physics from the University of Texas at Austin.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/improving-your-coding-and-professional-craft/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Meeting New Challenges of Scaling Parallelism Across and Within Cores</title>
		<link>http://goparallel.sourceforge.net/meeting-new-challenges-of-scaling-parallelism-across-and-within-cores/</link>
		<comments>http://goparallel.sourceforge.net/meeting-new-challenges-of-scaling-parallelism-across-and-within-cores/#comments</comments>
		<pubDate>Tue, 23 Apr 2013 19:14:44 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=3078</guid>
		<description><![CDATA[&#160; With today’s powerful new multi-core processors, if you’re not vectorizing – breaking data into chunks – you risk leaving 8x or 16X of Intel Xeon Phi’s theoretical peak power on the table and not scaling code efficiently. The solution, according to Intel’s Ron Green, is for programmers to start look at internal parallelization of [...]]]></description>
				<content:encoded><![CDATA[<p>&nbsp;</p>
<p>With today’s powerful new multi-core processors, if you’re not vectorizing – breaking data into chunks – you risk leaving 8x or 16X of Intel Xeon Phi’s theoretical peak power on the table and not scaling code efficiently. The solution, according to Intel’s Ron Green, is for programmers to start look at internal parallelization of processor as well as across processors. Join Ron and Go Parallel Editor Joe Maglitta for a thought-provoking discussion about how you can get the most from today’s powerful new parallel hardware. (11:27)</p>
<p><iframe src="http://www.youtube.com/embed/BtHnZCScA_0?rel=0" frameborder="0" width="560" height="315"></iframe></p>
<p><strong>Key topics</strong></p>
<ul>
<li>Growing role of vectorization</li>
<li>Automatic vectorization</li>
<li>Intel Parallel Advisor 2013 – What’s New</li>
<li>Finding hotspots</li>
<li>Defining race conditions and errors</li>
<li>Moving away from “high handholding”</li>
</ul>
<p><strong>Related posts</strong></p>
<p><a href="http://goparallel.sourceforge.net/how-developers-can-handle-the-new-hardware-complexity/" target="_blank">How Programmers Can Handle the New Hardware Complexity (video)</a></p>
<p><a href="http://goparallel.sourceforge.net/identifying-and-modeling-parallel-software-intels-mark-davis-previews-topics-and-tips-from-the-2013-intel-software-conference-road-show/" target="_blank">Identifying and Modeling Parallel Software (video)</a></p>
<p><a href="https://makebettercode.com/cbsi/cluster_parallel/index.php?utm_source=Geeknet+-+Go+Parallel+Portal+Text+Links+-+Text&amp;utm_medium=Link&amp;utm_content=Advisor+XE+2013&amp;utm_campaign=2013_Intel_DPD" target="_blank">Product Information: Intel Parallel Advisor 2013 (product)</a></p>
<p><a href="http://goparallel.sourceforge.net/new-parallel-programming-guide/" target="_blank">New Parallel Programming Guide (book)</a></p>
<p><a href="http://goparallel.sourceforge.net/new-bible-of-high-performance-parallel-programming" target="_blank">New Must Read on Parallel Programming (book)</a></p>
<p>SPEAKER BIO:</p>
<p><strong>Ronald W. Green Manager, HPC and Fortran Compiler Support, Intel Conference:</strong> Houston, TX  <strong>Sessions</strong></p>
<p><em>Identifying and Modeling Parallel Software: Intel&reg; Advisor XE 2013</em></p>
<p><em></em>Ronald specializes in massively parallel software development and systems architectures. He has been active in technical computing and high performance computing since 1987, and has participated in getting three systems into the Top 5 of the HPC TOP500 list over the course of his career. Ronald joined Intel in 2005, and currently manages compiler support from Intel&#8217;s Rio Rancho, New Mexico facility. Aside from his management responsibilities, Ronald helps run compiler and tools beta programs, moderates Intel&reg; Software User Forums, contributes to compiler online documentation and samples, and helps with future product definition for Intel&reg; Compiler products. Most recently, he has been assisting with the early test and launch of the compiler products supporting the Intel&reg; Xeon Phi&trade; coprocessor. He holds an M.S. in computer engineering from the University of Southern California.</p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/meeting-new-challenges-of-scaling-parallelism-across-and-within-cores/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tuning OpenMP Applications</title>
		<link>http://goparallel.sourceforge.net/better-performance-analysis-of-hpc-apps-tuning-openmp-applications/</link>
		<comments>http://goparallel.sourceforge.net/better-performance-analysis-of-hpc-apps-tuning-openmp-applications/#comments</comments>
		<pubDate>Thu, 28 Mar 2013 21:21:32 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2978</guid>
		<description><![CDATA[&#160; High performance computing (HPC) has a long history and today is critical to business, research, and science. Clusters consisting of thousands of machines help enable many advances of modern science with both theoretical and practical implications, working 24/7 to enrich the lives of every person on earth. This article from the latest issue of [...]]]></description>
				<content:encoded><![CDATA[<p>&nbsp;</p>
<p>High performance computing (HPC) has a long history and today is critical to business, research, and science. Clusters consisting of thousands of machines help enable many advances of modern science with both theoretical and practical implications, working 24/7 to enrich the lives of every person on earth.</p>
<p>This article from the latest issue of Intel <em>Parallel Universe </em>magazine describes ways of better understanding performance analysis of an HPC program using the Intel VTune Amplifier XE Amplifier. The techniques were illustrated using an Intel Xeon Phi coprocessor, but apply equally well to an Intel&reg; Xeon&reg; system. One nice benefit is that tuning to improve the parallelism in your application usually yields performance benefits when running on both Intel Xeon processors and Intel Xeon Phi coprocessors: a double win!</p>
<h3><strong><em><a href="http://goparallel.sourceforge.net/wp-content/uploads/2013/03/7625_2_IN_ParallelMag_Issue13_TuneOpenMP.pdf" target="_blank">Read the complete article here.</a></em></strong></h3>
<p><img class="aligncenter size-full wp-image-2979" title="Untitled" src="http://goparallel.sourceforge.net/wp-content/uploads/2013/03/Untitled.png" alt="" width="798" height="863" /></p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/better-performance-analysis-of-hpc-apps-tuning-openmp-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Chess Puzzle: Learning to Love Fast Rejection</title>
		<link>http://goparallel.sourceforge.net/the-chess-puzzle-learning-to-love-fast-rejection/</link>
		<comments>http://goparallel.sourceforge.net/the-chess-puzzle-learning-to-love-fast-rejection/#comments</comments>
		<pubDate>Mon, 18 Mar 2013 14:53:28 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2887</guid>
		<description><![CDATA[There is much to be said for fast rejection. It saves time and effort that can be better spent searching elsewhere. This article discusses a parallel algorithm for solving a chess puzzle that exploits fast rejection. It a good demonstration of basic Intel&#174;&#160;Cilk&#8482; Plus programming to solve an interesting puzzle. The puzzle is whether a [...]]]></description>
				<content:encoded><![CDATA[<p>There is much to be said for fast rejection. It saves time and effort that can be better spent searching elsewhere. This article discusses a parallel algorithm for solving a chess puzzle that exploits fast rejection. It a good demonstration of basic <a href="http://cilkplus.org/" target="_blank">Intel&reg;&nbsp;Cilk&trade; Plus</a> programming to solve an interesting puzzle.</p>
<p>The puzzle is whether a player’s eight chess pieces (excluding pawns) can attack all squares on a chess board, assuming that the two bishops must be on opposite-color squares. &nbsp;</p>
<p>Some others and I published a <a href="http://comjnl.oxfordjournals.org/content/32/6/567.full.pdf" target="_blank">serial algorithm</a> for the problem in 1989. &nbsp;The algorithm relies on an interesting rejection test that quickly rejects large portions of the search space. &nbsp;It places more than eight pieces on the board at once and checks whether all squares are under attack,<em> ignoring the blocking effects of the pieces</em>. &nbsp; If not, then any subset of those pieces cannot attack all squares. The opening paragraph is worth reading for Skiena’s sly remark about pruning – we were surprised that editors of an academic journal kept it.</p>
<p>The paper notes that the original program took 75 minutes in 1988 on a Sun 3/360. Machines have gotten much faster since then. &nbsp;I lost the original code, but was able to rewrite a parallel version from scratch, without the one-level look-head mentioned in the paper. &nbsp;The parallel version can solve the same problem in less than two seconds on a high-end 16-core machine (a two-socket Intel(R) Xeon(R) Processor E5-2670L). &nbsp;</p>
<p>Here I will explain the parallelization of the algorithm. &nbsp;I’ll assume that you have already at least skimmed over sections 2-3 of the paper to understand the serial algorithm. &nbsp;&nbsp;</p>
<p>The code is attached. &nbsp;It is a single source file. &nbsp;I recommend reading it top to bottom. &nbsp;Two macros affect its behavior:</p>
<ul>
<li>Compile with –DPARALLEL=0 to compile as serial code. &nbsp;&nbsp;</li>
<li>Compile with -DBISHOPS_CAN_BE_ON_SAME_COLOR=0 to solve the original problem for which I stated times.</li>
</ul>
<p>Removing the bishop constraint approximately doubles the work. &nbsp;I made the harder problem &nbsp;(unconstrained bishops) the default because enables the program to show some solutions, &nbsp;and modern machines are fast enough to solve it within my patience limit. &nbsp;&nbsp;</p>
<p><strong>Parallelization</strong></p>
<p>Parallelizing with Cilk Plus requires only minor changes to the serial code. &nbsp;The following sections explain these changes.</p>
<p><strong>Fork-Join</strong></p>
<p>The algorithm performs recursive divide-and-conquer. &nbsp;See the <a href="http://comjnl.oxfordjournals.org/content/32/6/567.full.pdf">paper</a> for details. &nbsp;Here is the key routine:<br />
<pre>&nbsp;void Search( const Board& b ) {
&nbsp; &nbsp; if( !b.reject() ) {
&nbsp; &nbsp; &nbsp; &nbsp; int i = b.chooseAxis();
&nbsp; &nbsp; &nbsp; &nbsp; if( i<0 ) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Found a weak solution
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ++WeakCount;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if( b.strongAttacks().isAll() )
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Found a strong solution
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if( !b.hasSuperposition() )
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Found solution with no superposition. &nbsp;Print it.
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Output << b << std::endl;
&nbsp; &nbsp; &nbsp; &nbsp; } else {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Unfold on axis i and search both halves in parallel
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cilk_spawn Search( Board(b,i,0) );
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Search( Board(b,i,1) );
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; }
&nbsp; &nbsp; // implicit cilk_sync
}&nbsp;</pre><br />
To parallelize it, I had to indicate that the two recursive calls to Search can run in parallel. &nbsp;To do this, I prefixed the first call with cilk_spawn, which says that the caller can keep on going without waiting for the callee to return. &nbsp; I could have also inserted a cilk_sync after the two calls, which would say to wait until the spawned callee returns, but I didn’t since it would be redundant in this example. &nbsp;Cilk Plus always has an implicit cilk_sync at the end of a routine.</p>
<p><strong>Reducers</strong></p>
<p>There is more parallel magic in routine Search than meets the eye. &nbsp;Note the two lines “++WeakCount;” and “Output &lt;&lt; b &lt;&lt; std::endl;”. &nbsp;Both line operate on global variables unprotected by locks. &nbsp;If I were writing ordinary multithreaded code, these lines would almost surely lead to missing updates to WeakCount and non-deterministic output. &nbsp;But the program is deterministic because I declared WeakCount and Output as reducers.</p>
<p>Reducers &nbsp;are Cilk Plus objects for which different threads get different “views”, and the views are automatically merged in a way to deliver the same result as the equivalent serial program. &nbsp;The views of WeakCount are partial sums that are automatically added together to get the correct total. &nbsp;The program checks that the total matches the value (8715) reported in the paper (bishops constrained). &nbsp;The reducer Output acts like a std::ostream, except that it cleverly merges partial output such that the final output is identical to what the serial version of the program prints.</p>
<p><strong>Scaling</strong></p>
<p>The code scales well because it has a lot of parallel slack (excess available parallelism) and is not memory intensive. &nbsp;If you have Cilk Plus on your system, I invite you to time the serial versus parallel versions of the code. &nbsp;Here are the recommended command lines for compiling it with the Intel compiler on Linux* or Windows* using the Intel compiler:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="50">&nbsp;</td>
<td valign="top" width="287"><strong>Linux</strong></td>
<td valign="top" width="267"><strong>Windows</strong></td>
</tr>
<tr>
<td valign="top" width="50"><strong>serial</strong></td>
<td valign="top" width="287">icc -O2 -xHost chess-cover.cpp –lrt –DPARALLEL=0</td>
<td valign="top" width="267">icl /O2 /QxHost chess-cover.cpp /DPARALLEL=0</td>
</tr>
<tr>
<td valign="top" width="50"><strong>parallel</strong></td>
<td valign="top" width="287">icc -O2 -xHost chess-cover.cpp –lrt</td>
<td valign="top" width="267">icl /O2 /QxHost chess-cover.cpp</td>
</tr>
</tbody>
</table>
<p>The options presume that your compiler paths are set up for using TBB, which I used for its portable wallclock timing facility. &nbsp;The option -xHost tells the compiler to optimize for the host machine processor . &nbsp;Using it gained me about a 15% improvement. &nbsp;<a href="http://software.intel.com/en-us/articles/what-the-is-parallelism-anyhow-1">Theoretical analysis</a> of the program’s parallel speedup requires two numbers:&nbsp;</p>
<ul>
<li><strong>Work</strong>: The total number of instructions executed.&nbsp;</li>
<li><strong>Span</strong>: The number of instructions on the critical path.</li>
</ul>
<p>The ratio work/span is a formal measure of parallelism in the program. &nbsp;For example, if work=span, the parallelism equals one; that is the program is serial.</p>
<p>My program does relatively little work (only 1,832 instructions on average) between fork/join actions, so it is better to use something called “Burdened Span”, which accounts for synchronization overheads. &nbsp;&nbsp;</p>
<p>Since leaves in the search tree have different depths, estimating the span is a bit tricky. &nbsp;So I let the Cilk view scalability analyzer (you can get it from the Intel(R) Cilk(TM) Plus SDK at <a href="http://www.cilkplus.org/download">http://www.cilkplus.org/download</a>) &nbsp;do the work for me. &nbsp;It reports the following statistics for solving the problem with unconstrained bishops:<br />
<pre>&nbsp; &nbsp;Work : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;162,169,367,032 instructions
&nbsp; &nbsp;Span : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;58,705 instructions
&nbsp; &nbsp;Burdened span : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1,123,705 instructions
&nbsp; &nbsp;Parallelism : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2762445.57
&nbsp; &nbsp;Burdened parallelism : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;144316.67
&nbsp; &nbsp;Number of spawns/syncs: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 129,851,246
&nbsp; &nbsp;Average instructions / strand : &nbsp; &nbsp; &nbsp; 416
&nbsp; &nbsp;Strands along span : &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;43</pre><br />
A strand is a sequence of serial instructions between synchronization operations (spawn or sync operations). The average strand is only 416 instructions. Given that small amount, Cilk Plus’s low-cost fork/join is helpful. The “burdened parallelism” shows the ratio “work”/“burdened span”. The value indicates that this program can theoretically scale to 100 thousand hardware threads on an ideal machine. Of course it is just an estimate for an ideal machine. However, I’ve seen the program speed up by 28x on a real 40-core machine.</p>
<p><strong>Summary</strong></p>
<p>Intel Cilk Plus enabled speeding up the puzzle solver with a few changes, and the resulting program scales well and behaves deterministically.</p>
<p><em>Bio: Arch was the architect of Threading Building Blocks, and was the lead developer for KAI C++. At Shell he worked on seismic imaging on a 256 node nCUBE. He has a Ph.D. in computer science from the University of Illinois. Arch is one of the authors of the book *Structured Parallel Programming: Patterns for Efficient Computation*.</em></p>
<p><strong><em>For more information about Intel Cilk Plus, see the website&nbsp;</em></strong><a href="http://cilkplus.org/"><strong><em>http://cilkplus.org</em></strong></a><strong><em>. &nbsp;</em></strong></p>
<p><strong><em>For questions and discussions about Intel Cilk Plus, see the forum&nbsp;</em></strong><a href="http://software.intel.com/en-us/forums/intel-cilk-plus"><strong><em>http://software.intel.com/en-us/forums/intel-cilk-plus</em></strong></a></p>
<p><a href="http://software.intel.com/sites/default/files/article/366828/chess-cover.cpp" target="_blank"><strong><em></em></strong>Download chess-cover.cpp here.&nbsp;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/the-chess-puzzle-learning-to-love-fast-rejection/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Developers Can Handle the New Hardware Complexity</title>
		<link>http://goparallel.sourceforge.net/how-developers-can-handle-the-new-hardware-complexity/</link>
		<comments>http://goparallel.sourceforge.net/how-developers-can-handle-the-new-hardware-complexity/#comments</comments>
		<pubDate>Mon, 18 Mar 2013 14:24:01 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2883</guid>
		<description><![CDATA[More processors, more cores, more threads, wider registers. The latest generation of new processors introduces new complexity. Today’s “hardware explosion” requires a new way of thinking about architecting, building and tuning parallel programs to take advantage of powerful new capabilities. Join Intel Senior Engineer Gary Carleton and Go Parallel Editor Joe Maglitta for an informative [...]]]></description>
				<content:encoded><![CDATA[<p><span style="font-size: 16px;">More processors, more cores, more threads, wider registers. The latest generation of new processors introduces new complexity. Today’s “hardware explosion” requires a new way of thinking about architecting, building and tuning parallel programs to take advantage of powerful new capabilities.</span></p>
<p>Join Intel Senior Engineer Gary Carleton and <em>Go Parallel</em> Editor Joe Maglitta for an informative video chat about new challenges facing parallel developers today.</p>
<p><iframe src="http://www.youtube.com/embed/YdN12eWx5hA?rel=0" frameborder="0" width="560" height="315"></iframe></p>
<p>Key issues: &nbsp;</p>
<ul>
<li>How can developers more intelligently architect programs to take advantage of the explosion in hardware capability?</li>
<li>How can today’s performance analysis and other be used most effectively?</li>
<li>How can you avoid “parallelism paralysis”?</li>
<li>Why do we need to rethink the roles of developer and compiler?</li>
<li>What is “Carleton’s Question”? Why is it important to you?</li>
<li>How can parallel developers better deal with CPU “hot spots”?</li>
</ul>
<p>And much more. Tune in, weigh in and&nbsp;<a href="https://makebettercode.com/cbsi/cluster_parallel/index.php?utm_source=Geeknet+-+Go+Parallel+Portal+Text+Links+-+Text&amp;utm_medium=Link&amp;utm_content=VTune+Amplifier+XE+2013+&amp;utm_campaign=2013_Intel_DPD" target="_blank">check out the tools</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/how-developers-can-handle-the-new-hardware-complexity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Identifying, Modeling, Designing, Optimizing Parallelism—  Intel’s Ronald Green previews topics and tips from the 2013 Intel Software Conference Road Show</title>
		<link>http://goparallel.sourceforge.net/identifying-modeling-designing-optimizing-parallelism-intels-ronald-green-previews-topics-and-tips-from-the-2013-intel-software-conference-road-show/</link>
		<comments>http://goparallel.sourceforge.net/identifying-modeling-designing-optimizing-parallelism-intels-ronald-green-previews-topics-and-tips-from-the-2013-intel-software-conference-road-show/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 20:59:17 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2783</guid>
		<description><![CDATA[New processors with up to 61 cores and bigger vector units open up whole new realms of for parallel developers. Join compiler and optimization whiz Ronald Green at the upcoming Intel Software Conference road show and learn how you can get the most from these exciting new capabilities. Ronald is part of an all-star team [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://goparallel.sourceforge.net/wp-content/uploads/2013/02/Untitled2.png" rel="wp-prettyPhoto[g2783]"><img class=" wp-image-2791 alignright" title="Untitled" src="http://goparallel.sourceforge.net/wp-content/uploads/2013/02/Untitled2.png" alt="" width="151" height="151" /></a>New processors with up to 61 cores and bigger vector units open up whole new realms of for parallel developers. Join compiler and optimization whiz Ronald Green at the upcoming Intel Software Conference road show and learn how you can get the most from these exciting new capabilities. Ronald is part of an all-star team of Intel experts presenting at the full-day complimentary seminar aimed at C, C## and Fortran developers.&nbsp; He’ll be speaking in Houston, TX (March 26).</p>
<p>Watch the video to learn more, read the full session abstract and bio, and <strong><a href="http://softwareproductconference.com/asmo/" target="_blank">register here</a>.</strong></p>
<p><iframe src="http://www.youtube.com/embed/UHLq55M9cRc?rel=0" frameborder="0" width="560" height="315"></iframe></p>
<p><strong>Sessions</strong></p>
<p><em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Identifying and Modeling Parallel Software: Intel&reg; Advisor XE 2013</em></p>
<p><em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Design: Optimize, Vectorize, and Parallelize—Intel&reg; Composer XE 2013</em></p>
<p><strong>&nbsp;</strong></p>
<p><strong>10:00 &#8211; 10:45 a.m. Identifying and Modeling Parallel Software: Intel&reg; Advisor XE 2013 Abstract: </strong>Intel&reg; Advisor XE 2013 is the first tool for novices and experts for adding parallelism to software. In this session, we demonstrate the straightforward Intel Advisor XE workflow. First, the tool surveys or collects performance data and identifies code regions likely to benefit from parallelism. Next, the code is annotated to indicate parallel regions to the Intel Advisor XE analysis utility. Intel Advisor XE will execute the annotated code to model the potential performance gains of running the annotated regions in parallel and report out those potential gains. Finally, Intel Advisor XE models the correctness to identify any potential data conflicts or data races so the developer can choose how to address them. Intel Advisor XE supports C, C++, Fortran, and C#.</p>
<p><strong>11:00 &#8211; 11:45 a.m. Design: Optimize, Vectorize, and Parallelize—Intel&reg; Composer XE 2013 Abstract: </strong>Intel&reg; Composer XE 2013 provides the industry&#8217;s most advanced optimizing compilers, along with easy-to-use parallel program models and performance libraries designed to extract the maximum performance available from today&#8217;s multicore architectures and the new Intel&reg; Many Integrated Core Architecture (Intel&reg; MIC). Intel Composer XE enables programmers of all skill levels to succeed in parallelization and vectorization, the two cornerstones of the multicore revolution. Unlike complex GPU models, parallel programming for multicore Intel&reg; architecture involves straightforward extensions to programmers&#8217; existing skills, making it accessible to everyone. In this presentation, we examine simple extensions to existing C, C++, and Fortran languages to enable parallelization and vectorization.</p>
<p><strong>Ronald W. Green-&nbsp;</strong><strong>Manager, HPC and Fortran Compiler Support, Intel</strong></p>
<p>Ronald specializes in massively parallel software development and systems architectures. He has been active in technical computing and high performance computing since 1987, and has participated in getting three systems into the Top 5 of the HPC TOP500 list over the course of his career. Ronald joined Intel in 2005, and currently manages compiler support from Intel&#8217;s Rio Rancho, New Mexico facility. Aside from his management responsibilities, Ronald helps run compiler and tools beta programs, moderates Intel&reg; Software User Forums, contributes to compiler online documentation and samples, and helps with future product definition for Intel&reg; Compiler products. Most recently, he has been assisting with the early test and launch of the compiler products supporting the Intel&reg; Xeon Phi&trade; coprocessor. He holds an M.S. in computer engineering from the University of Southern California.<strong></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/identifying-modeling-designing-optimizing-parallelism-intels-ronald-green-previews-topics-and-tips-from-the-2013-intel-software-conference-road-show/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Get 2.5X performance improvement using Full Vectors versus Scalar</title>
		<link>http://goparallel.sourceforge.net/get-2-5x-performance-improvement-using-full-vectors-versus-scalar/</link>
		<comments>http://goparallel.sourceforge.net/get-2-5x-performance-improvement-using-full-vectors-versus-scalar/#comments</comments>
		<pubDate>Mon, 25 Feb 2013 18:34:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2762</guid>
		<description><![CDATA[To help the compiler generate better vector code, sometimes it helps to decompose complex data structures to allow the compiler to understand the available parallelism and vectorize the code. &#160; &#160; Decomposing data accesses may allow the compiler to use more advanced features like vector gather and scatter. Though adjacent data elements are preferred in [...]]]></description>
				<content:encoded><![CDATA[<p>To help the compiler generate better vector code, sometimes it helps to decompose complex data structures to allow the compiler to understand the available parallelism and vectorize the code. &nbsp; &nbsp;</p>
<p>Decomposing data accesses may allow the compiler to use more advanced features like vector <em>gather</em> and <em>scatter</em>. Though adjacent data elements are preferred in order to maximize the performance advantage of vector loads and stores, sometimes the requirement to access data through indices is unavoidable. &nbsp;</p>
<p>&nbsp;Though use of vector gathers and scatters is generally slower than vector loads and stores, they can be beneficial if the amount of vector computation following the creation of the vectors is enough to offset the vector gather time. The compiler will generate vector gathers and scatters as needed, but sometimes it can be challenged by complex data structures. Showing the compiler how to access these complex data structures can help.</p>
<p><strong><a href="http://software.intel.com/en-us/articles/bkm-coaxing-the-compiler-to-vectorize-structured-data-via-gathers" target="_blank">Read the rest of the article here.</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/get-2-5x-performance-improvement-using-full-vectors-versus-scalar/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Achieving Better Parallel Performance of Fortran Programs</title>
		<link>http://goparallel.sourceforge.net/achieving-better-parallel-performance-of-fortran-programs/</link>
		<comments>http://goparallel.sourceforge.net/achieving-better-parallel-performance-of-fortran-programs/#comments</comments>
		<pubDate>Tue, 19 Feb 2013 20:16:41 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2738</guid>
		<description><![CDATA[Learn how to identify hotspots &#8212; the most time-consuming program units, how to effectively use available cores, how to discover causes of ineffective utilization, and much more.&#160; This information webinar presentation shows how you can leverage parallelization technology to achieve better performance on multicore systems. Includes a high-level overview of Intel Vtune Amplifier XE 2013 [...]]]></description>
				<content:encoded><![CDATA[<p>Learn how to identify hotspots &#8212; the most time-consuming program units, how to effectively use available cores, how to discover causes of ineffective utilization, and much more.&nbsp; This information webinar presentation shows how you can leverage parallelization technology to achieve better performance on multicore systems. Includes a high-level overview of Intel Vtune Amplifier XE 2013 features and specifics examples of performance tuning on a Fortran application.&nbsp; (39:57)</p>
<p><iframe src="http://www.youtube.com/embed/2kto9EjLyzI" frameborder="0" width="420" height="315"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/achieving-better-parallel-performance-of-fortran-programs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Offload Runtime for the Intel(R) Xeon Phi  Coprocessor</title>
		<link>http://goparallel.sourceforge.net/offload-runtime-for-the-intelr-xeon-phi-coprocessor/</link>
		<comments>http://goparallel.sourceforge.net/offload-runtime-for-the-intelr-xeon-phi-coprocessor/#comments</comments>
		<pubDate>Tue, 12 Feb 2013 14:37:10 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tune]]></category>

		<guid isPermaLink="false">http://goparallel.sourceforge.net/?p=2687</guid>
		<description><![CDATA[The Intel&#174; Xeon Phi&#8482; coprocessor platform has a software stack that enables new programming models, including offload of computation from host processor to the Intel&#174; Xeon Phi&#8482; coprocessor to improve response time and/or throughput. A new paper shares draws on insights from a multi-year, intensive development effort to answer common questions, why offload to a [...]]]></description>
				<content:encoded><![CDATA[<p>The Intel&reg; Xeon Phi&trade; coprocessor platform has a software stack that enables new programming models, including offload of computation from host processor to the Intel&reg; Xeon Phi&trade; coprocessor to improve response time and/or throughput. A new paper shares draws on insights from a multi-year, intensive development effort to answer common questions, why offload to a coprocessor is useful, how it is specified, and conditions for the profitability of offload. You’ll learn about: the software architecture and design of the offload compiler runtime, key performance features and their impact for a set of directed micro-benchmarks and larger workloads.</p>
<p><strong><a href="http://software.intel.com/sites/default/files/article/366893/offload-runtime-for-the-intelr-xeon-phitm-coprocessor.pdf" target="_blank">Download the paper here</a></strong>. (pdf)</p>
]]></content:encoded>
			<wfw:commentRss>http://goparallel.sourceforge.net/offload-runtime-for-the-intelr-xeon-phi-coprocessor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
