Bring New Life to Legacy C++ Code with Parallel Studio Share your comment!

In the software engineering class I teach, we discuss the need and importance of refactoring code. Over time, it is inevitable that code must be reworked to meet new needs. Sometimes customers complain about performance, while other times many software engineers want cleaner and more maintainable code. Using Intel Parallel Studio gives you the chance for both. It delivers new libraries with which you can add features and substantial performance increases, while taking the opportunity to neaten and refactor code so that it becomes more maintainable. This blog gets you started reworking your mountains of C++ legacy code, and bring them up to modern standards.

Splitting Threads

Many performance issues can be traced back to code with loops, possibly nested loops that take more time to execute than is optimal. Sometimes the time lag can be mitigated by executing the offending code during program initialization when users expect to wait before the program is ready. But even this can pose a serious usability issue if the delay time is past most users’ comfort threshold. Developers are inevitably faced with the task of solving the dilemma by speeding up the code in question. This problem is even more pronounced with refactoring legacy code that may have slow sections. Fortunately, Parallel Studio has a solution to many of these circumstances.

First, I will present a situation in which C++ code from software that I developed about 10 years ago was directly used in software that I recently developed. At application runtime, the software generated a lookup table for Red, Green, and Blue (RGB) to Hue, Saturation, and Value (HSV) conversions. Creating the lookup table took long enough to warrant a progress bar that marched across the screen as the lookup table was generated. Creating the lookup table in advance is the best option since performing calculations is too slow to be done in real time during the data processing. In today’s environment, though, users are far more sensitive to progress bars during program initialization. For that reason, I turned to Parallel Studio for help.

The code is a set of nested for loops, each of which corresponded to the color channels of Red, Green, and Blue. A call is made to a function that converts the RGB data to HSV, and then stores it in the lookup table. In this way, the conversion calculation does not need to be done in the real-time processing operations. The code can be seen in Listing 1 below.


for ( int r=0; r<256; r++ )
{
for( int g=0; g<256; g++ )
{
for( int b=0; b<256; b++ )
{
WORD h, s, v;
RGB2HSV( (WORD) r, (WORD) g, (WORD) b, &h, &s, &v );
RGB2HSV[nIndex++] =
( (DWORD)h << 16 ) | ( (DWORD)s << 8 ) | (DWORD)v;
}
}
}

Listing 1: The original code with three nested loops to convert from all RGB values to HSV values.

Before doing any work, a benchmark timing was needed. I added code to obtain the number of milliseconds it took to execute this conversion code. With the Microsoft Visual C++ compiler in Release mode, this code executed in 608 milliseconds. This was already a vast improvement from when it ran on slower machines ten years ago.

The first improvement came when I compiled with the Intel compiler. The timing in Release mode now was 421 milliseconds, a 30 percent improvement over the original time. So, with a simple switch to the Intel compiler, fantastic performance gains were had.

The next thing that I did, though, is where the genius of Intel’s technology comes in. Parallel Studio provides a way to break up for loops so that they run within multiple threads. To alter code to do it is extremely easy, too. All that is required is to exchange the for keyword on the first loop with the cilk_for keyword. After running my test program I was amazed to see an execution time of 125 milliseconds, which represents a 79 percent improvement over the original time. Listing 2 shows an abbreviated version of the new source code. The results of these three timing tests can be seen in Figure 1.


cilk_for ( int r=0; r<256; r++ )
{
for( int g=0; g<256; g++ )
{
for( int b=0; b<256; b++ )
{
. . .
}
}
}

Listing 2: The original code with three nested loops to convert from all RGB values to HSV values, but replacing the for keyword in the first loop with the cilk_for keyword.

pic1

Figure 1: Timings running the code with the Visual C++ compiler, the Intel compiler, and the Intel compiler with the cilk_for construct.

What is going on behind the scenes deserves some explanation. First, it is important to note that there can be any number of threads provided to execute the for loop in parallel, and this detail is taken care of for you. Often the number of threads given to your code depends on the number of processors and the amount of activity that the system is currently supporting. If the code has been given four threads, this code gets broken up into four pieces that execute in parallel.

Inside of the first for loop, the number of allocated threads is determined. For each of the parallel for executions, a thread number is obtained. For four threads, the thread numbers will have the values of 0 to 3.
Each of the threads starts the loop at the index corresponding to their number. For instance, thread number 0 starts at index 0; and thread number 1 starts at index 1. Then each thread increments its loop counter by the number of threads, which for this example is four. So thread 0 counts with the sequence 0, 4, 8, 12, 16, and so forth. Thread 1 counts with the sequence 1, 5, 9, 13, 17, and so forth. The following illustrations show this graphically.

Thread 0 starts by counting at 0. Since there are 4 threads for this example, it then counts with the sequence 4, 8, 12, and so forth.pic2

Thread 1 starts by counting at 1. Since there are 4 threads for this example, it then counts with the sequence 5, 9, 13, and so forth.pic3

This process continues as each of the allocate threads traverses the thread from its unique starting point, and moving to the next count in its sequence by adding the number of threads.
Thread 2
pic4

Thread 3
pic5

Using the Math Library

The second thing in the Parallel Studio quiver that we can use is the math library. It provides a number of advanced math functions that are very useful for applications such as engineering applications, financial modeling applications, and statistical analysis. Before using the math library in Visual Studio, you must go to the project settings, Configuration Properties, Intel Performance Libraries, and set the Use Intel MKL drop down.

Matrix Multiplication

Matrix multiplication is a staple of computer science owing its introduction to the field of Linear Algebra. Its uses are varied including solving problems in Graph theory, machine learning and prediction, and economics analysis. The Parallel Studio libraries would have been really helpful when I developed the software for my dissertation, because I could have used the data fitting functions while establishing the relationship of data compression to entropy. (I won’t tell you how badly that code needs to be refactored!) The Intel Math Kernel Library (Intel MKL) supplies some flexible and fast functions for math tasks. The one that we will examine below is the cblas_dgemm() function. It multiplies two matrices together, and gives you a third matrix with the results. As you will see, this function is very easy to use and well worth the effort.

First, we start by allocating the memory for the matrices. The mkl_malloc() function gives us the chance to align the memory on a 64-byte boundary, which will give better performance. The following code allocates three memory buffers for matrices named A, B, and C and calls a function that initializes that values contained in them. The C matrix will contain the result of A and B multiplied together after cblas_dgemm() is called.


int nFirst = 2000, nSecond = 200, nThird = 1000;
A = (double *)mkl_malloc( nFirst * nSecond * sizeof( double ), 64 );
B = (double *)mkl_malloc( nSecond * nThird * sizeof( double ), 64 );
C = (double *)mkl_malloc( nFirst * nThird * sizeof( double ), 64 );
InitializeMatrixValues( A, B );

Second, we call cblas_dgemm() to do the multiplication. You can see the code below. It is important to note that CblasRowMajor and CBlasNoTrans are part of mkl.h. nFirst, nSecond, and nThird provide the matrix dimensions, the matrices are passed in, and two values representing alpha and beta are passed in. Figure 2 shows the matrix A with its initial values, matrix B with its initial values, and matrix C with the resulting values.


cblas_dgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, nFirst,
nThird, nSecond, 1.0, A, nSecond, B, nThird, 1.0, C, nThird );

pic6

Figure 2: Matrix A and B initial values, and resulting matrix C values.

Conclusion

As you can see, Parallel Studio can bring new life to your legacy C++ code. Here you have seen how parallelization of code can usher in incredible performance gains, and how the math libraries can provide easy and fast math functions. There is much more than this that I am planning to cover in future blogs.

Posted on April 3, 2015 by Rick Leinecker, Slashdot Media Contributing Editor