Data Normalization with SIMD Vectorization Share your comment!

Data sets often represent collected values that reflect real-world situations. For instance, census data might contain the ages of all residents within a certain township. Another example is when schools aggregate grade averages among classrooms. Many times, though, data collections have strong components with large magnitudes that tend to overwhelm all other components because their magnitude is so much greater. When this is the case, analysis of such a skewed data set may not provide usable results or conclusions. For this reason, many data sets are normalized before they are analyzed.

And this normalized data normalization can provide more accurate conclusions and correlations once the data has been analyzed. This article discusses how to normalize data using both the Intel Math Kernel Library and the Intel SIMD pragma, both of which are part of Intel’s Parallel Studio.

Let’s start by looking at data that defies accurate analysis until it is normalized. Suppose we have a data set containing the number of people who enjoy jelly beans of different colors. Maybe the data was gathered via an online questionnaire. As you can see in Figure 1, almost everyone likes red the best. But when analyzing the data the colors green, blue, and purple are so much smaller in comparison that their statistical significance is low. Figure 2 shows the data once it has been normalized.

040715_pic1.png
Figure 1: The magnitude of the red value is so much greater than green, blue, and purple that the statistical significance of those will be low.

040715_pic2.png
Figure 2: The normalized data gives the lower three values more statistical validity.

Looking at the initial data shows us that red is clearly dominant, but its dominance prevents a true statistical treatment of green, blue, and purple. The solution that will allow an analysis which has statistical integrity is to normalize the data before it is analyzed. We will now look at two ways to accomplish this with Intel’s Parallel Studio.

Intel Math Kernel Library

Intel’s Math Kernel Library provides hundreds of mathematical functions to help in calculations, analysis, and statistics. What we will use for our data normalization comes from the Basic Linear Algebra Subprograms (BLAS) section. BLAS includes vector operations, matrix-vector operations, and matrix-matrix operations. All of the BLAS functions are optimized to take advantage of Intel single instruction, multiple data (SIMD) technology. What we will use are two vector operations that act upon vectors of the double type. There are two steps to the process of data normalization for our example: obtain a Euclidean norm from the vector, and then adjust the vector so that its data is normalized.

A Euclidean norm is the square root of the sums of all items squared which are in the vector. A Euclidean norm is thought of as its length, or the distance from its endpoint to its origin. For instance, the following double vector’s Euclidean norm can be calculated by the formula that follows.


double dInputData[] = { 5, 6, 7, 8, 9, 10 };
double dEuclideanNorm = sqrt( 5 * 5 + 6 * 6 + 7 * 7 + 8 * 8 +
9 * 9 + 10 * 10);

We will discuss doing this using iterative C++ code in the next section. For now, though, let’s look at how we can get the Euclidean norm without writing any code except a single function call. The following shows how to use the cblas_dnrm2() function to obtain the norm.


double dInputData[] = { 5, 6, 7, 8, 9, 10 };
double dEuclideanNorm = cblas_dnrm2( 6, dInput, 1 );

Now we need to use the derived norm to normalize the vector data. To do this requires another simple function call to the cblas_dscal() function as shown below. After this function call, the data will be normalized as shown.

cblas_dscal( 6, dNorm, dInputData, 1 );
// dInputData normalized to:
// { 0.26537244621713763, 0.31844693546056513,
// 0.37152142470399269, 0.42459591394742019,
// 0.47767040319084769, 0.53074489243427525 }

Using the SIMD Pragma

There are times when you need more control over vectorization normalization. Such cases occur when you have knowledge about the data set that is important to use when normalizing it. In these cases, you can use the SIMD pragma to optimize vector operations. The following uses the SIMD pragma to let Parallel Studio know that you want to optimize the vector operation using SIMD technology. Note that the reduction directive causes the compiler to provide each simultaneous loop a separate copy of dEuclideanNorm that are all combined at the end of the loop in order to avoid race conditions.


double dEuclideanNorm = 0.0;
#pragma simd reduction (+:dEuclideanNorm)
for( int i=0; i<6; i++ )
{
dEuclideanNorm += ( dInputData[i] * dInputData[i] );
}
dEuclideanNorm = sqrt( dEuclideanNorm );

Now that we have the norm, we need to loop through and normalize the data as follows.

#pragma simd
for( int i=0; i<6; i++ )
{
dInputData[i] /= dEuclideanNorm;
}

Conclusion

As you can see, Intel Parallel Studio provides easy, fast, and efficient ways to normalize data. We will take a look at more of the Intel Math Kernel Library in the future since it offers an extensive arsenal of tools to bring your programs to a new level.

Posted on April 7, 2015 by Rick Leinecker, Slashdot Media Contributing Editor