Digging Deeper Into Vectorized Parallel Assembly Code 1 Comment

In my last blog post (“Digging Under the Hood with Vectorization and the Intel Compiler”), I wrote about how to get to the assembly language level of your code. If you went through the steps, you should have seen some assembly instructions that might not be familiar to you. I know when I learned Intel assembly eons ago, the instructions were single-register instructions that primarily consisted of moving data between memory and those registers, along with the usual comparison and jumps. Today, however, we see code that starts like this in C++:

1
sqrt(x)

…which gets compiled down to this assembly code:

1
sqrtps    xmm3, xmm0

This is a single instruction that performs a square root simultaneously across several numbers stored in a single register. This is in contrast to this instruction, which only performs a single square root on the lowest-order double-word:

1
sqrtss    xmm1, xmm1

The sqrt portion of the names obviously stands for square root, while the ps stands for packed single-precision (some websites incorrectly state that the “p” stands for parallel, but it’s actually packed) and the ss stands for scalar single-precision.

Processor Support Differences

But the question is: Not all processors support the advanced instructions. The manuals state that the compiler generates both a vectorized function and a non-vectorized function – and the non-vectorized function is for use on processors that don’t have advanced instructions. But what exactly is going on there? Let’s dig into the assembly code and find out.

First, let’s start with some code like this:    

1
__declspec(vector) float domath(float op1, float op2) { 
2
     return sqrt(op1) + sqrt(op2);
3
};

This tells the compiler to try to generate a vectorized version of the function for use in a SIMD loop. And indeed, when I compiled it, I saw this message: 

1
FUNCTION WAS VECTORIZED

Debugger Follow-Up

In my previous blog, I explained how you can attach a debugger to the non-debug version of the running executable and trace through the code. Just a quick follow-up to that: there’s another option without having to start a separate instance of Visual Studio. You can execute your program in Release configuration. Then, in the same instance of Visual Studio, click Debug -> Attach To Process. That works just as well. I encourage you to set breakpoints and step through the individual assembly code. Meanwhile, I’ll show you the generated assembly and what they do.

Comments for Clarity

What’s nice is the assembly code has the original C++ code included as comments, which makes it easy to find what you need. However, remember that the code has been heavily optimized. That means, among other things, that function calls might get generated as inline, which makes the generated assembly code quite different from the C++ code without a one-to-one correspondence. But you can still trace through it and see what it’s doing.

Due to space, I’m not going to put the entire code here, but I’ll show you some of it. Indeed, there’s more than one version of the code inside my loop. The code does some tests, and depending on the results of the tests, runs this code:

1
movss     xmm1, DWORD PTR [edx+ebx*4]
2
movss     xmm0, DWORD PTR [ecx+ebx*4]
3
sqrtss    xmm1, xmm1
4
sqrtss    xmm0, xmm0
5
addss     xmm1, xmm0
6
movss     DWORD PTR [esi+ebx*4], xmm1

This is the simple scalar version. It performs the two square roots, and then moves the result into the destination. Or, if the capabilities are present, it runs either this code:

01
movups    xmm0, XMMWORD PTR [edx+eax*4]
02
movups    xmm1, XMMWORD PTR [ecx+eax*4]
03
sqrtps    xmm2, xmm0
04
sqrtps    xmm3, xmm1
05
cvtps2pd  xmm6, xmm2
06
cvtps2pd  xmm4, xmm3
07
movhlps   xmm2, xmm2
08
movhlps   xmm3, xmm3
09
cvtps2pd  xmm7, xmm2
10
addpd     xmm6, xmm4
11
cvtps2pd  xmm5, xmm3
12
cvtpd2ps  xmm0, xmm6
13
addpd     xmm7, xmm5
14
cvtpd2ps  xmm7, xmm7
15
movlhps   xmm0, xmm7
16
movntps   XMMWORD PTR [esi+eax*4], xmm0 

…or a very similar one, which I won’t reprint due to space. This code is indeed doing a vectorized version. Sort of. But it’s also doing some moving around between single-precision and double-precision floating point numbers.

More Questions

This opens up additional questions. First, why the conversion between single and double precision? Also, did the processor actually detect the feature set and determine the correct approach to use? And if so, what about additional features, e.g. MMX, Advanced Vector Extensions (AVX), and so on? Can we write custom code to support those? Indeed we can. We’ll tackle that next time.

Meanwhile, I encourage you to trace through your own assembly code to see what exactly is going on, and think about whether the code is optimal. And then share your thoughts in the comments below. And as always, have fun!

 _______________________________________________________________

Jeff Cogswell is a Geeknet contributing editor, and is the author of several tech books including C++ All-In-One Desk Reference For Dummies, C++ Cookbook, and Designing Highly Useable Software. A software engineer for over 20 years, Jeff has written extensively on many different development topics. An expert in C++ and JavaScript, he has experience starting from low-level C development on Linux, up through modern web development in JavaScript and jQuery, PHP, and ASP.NET MVC. 

Posted on by Jeff Cogswell, Geeknet Contributing Editor
1 comments
Sort: Newest | Oldest
som
som

good