Offload Your Code to Your GPU: How to Get Started Share your comment!

At the time of this writing, most desktop computers have video cards with multiple processor cores that support advanced graphics. But most of the time they are idle, just waiting for a graphically intense program to run. In the past, it has taken a bit of effort to take advantage of the extra cores on graphics cards. But Intel’s Parallel Studio makes it really easy, and I plan to utilize the technology in many of my future software development projects. This blog shows you the basics of implementing the Parallel Studio Graphics Technology.

Simple Example Program

To start with, we need a generic program with which we can explore implementing the Intel Graphics Technology. To keep things simple, I created a program that converts a buffer containing RGB values to a destination buffer containing grayscale values. The code is shown below in Listing 1.

Listing 1: The demonstration program that will be used to explore offloading to graphics processors.

We can optimize this a good bit by parallelizing the for loop. In fact, before we offload to the graphics processors, we are required to parallelize. So, the following change in Listing 2 will be made in order to provide the aforementioned optimization, and to prepare for offloading to the graphics processors.

Changes now to:

Listing 2: Parallelizing the code in preparation for offloading.

Synchronous Offload

There are two types of offloading that we will look at. The first is synchronous and the second is asynchronous. There are two steps for synchronous offloading. The first is that the method which is called must be decorated with __declspec (target(gfx)). The second, is that #pragma offload must be added immediately before the parallel for loop. Listing 3 shows the amended code from Listing 1 that offloads code to the graphics processors.

Listing 3: The amended code from Listing 1 that offloads processing to the graphics processors.

Let’s review the steps for synchronous offloading. First, loops which are offloaded must be parallelized, and in the example code cilk_for was used. Second, the method that is offloaded must be decorated with __declspec (target(gfx)). Third, the offload pragma must be used immediately before the parallel for loop.

Asynchronous Offload

From a software development standpoint, the biggest difference between synchronous and asynchronous offloading is that synchronous takes a pragma-based approach while asynchronous takes an API-based approach.

To get things started, we need to create a method that contains the parallelized for loop, which itself will be offloaded. Listing 4 shows the method that we will use for the asynchronous example which is preceded by __declspec(target(gfx_kernel)).

Listing 4: This entire method will be offloaded.

The RGBToGray() method from the first example continues to be preceded by __declspec (target(gfx)). Finally, the following code in Listing 5 performs everything necessary for the asynchronous offloading and execution.

Listing 5: This code performs the offloading and execution.


As you can see, offloading code to the graphics processors is straightforward and simple when using Parallel Studio. It is an excellent technology to explore in order to make your applications perform even better.




Posted on June 25, 2015 by Rick Leinecker, Slashdot Media Contributing Editor