In my last blog post (“Digging Under the Hood with Vectorization and the Intel Compiler”), I wrote about how to get to the assembly language level of your code. If you went through the steps, you should have seen some assembly instructions that might not be familiar to you. I know when I learned Intel assembly eons ago, the instructions were single-register instructions that primarily consisted of moving data between memory and those registers, along with the usual comparison and jumps. Today, however, we see code that starts like this in C++:
1 sqrt(x)
…which gets compiled down to this assembly code:
1 sqrtps xmm3, xmm0
This is a single instruction that performs a square root simultaneously across several numbers stored in a single register. This is in contrast to this instruction, which only performs a single square root on the lowest-order double-word:
1 sqrtss xmm1, xmm1
The sqrt portion of the names obviously stands for square root, while the ps stands for packed single-precision (some websites incorrectly state that the “p” stands for parallel, but it’s actually packed) and the ss stands for scalar single-precision.
Processor Support Differences
But the question is: Not all processors support the advanced instructions. The manuals state that the compiler generates both a vectorized function and a non-vectorized function – and the non-vectorized function is for use on processors that don’t have advanced instructions. But what exactly is going on there? Let’s dig into the assembly code and find out.
First, let’s start with some code like this:
1 __declspec(vector) float domath(float op1, float op2) {2 return sqrt(op1) + sqrt(op2);3 };
This tells the compiler to try to generate a vectorized version of the function for use in a SIMD loop. And indeed, when I compiled it, I saw this message:
1 FUNCTION WAS VECTORIZED
Debugger Follow-Up
In my previous blog, I explained how you can attach a debugger to the non-debug version of the running executable and trace through the code. Just a quick follow-up to that: there’s another option without having to start a separate instance of Visual Studio. You can execute your program in Release configuration. Then, in the same instance of Visual Studio, click Debug -> Attach To Process. That works just as well. I encourage you to set breakpoints and step through the individual assembly code. Meanwhile, I’ll show you the generated assembly and what they do.
Comments for Clarity
What’s nice is the assembly code has the original C++ code included as comments, which makes it easy to find what you need. However, remember that the code has been heavily optimized. That means, among other things, that function calls might get generated as inline, which makes the generated assembly code quite different from the C++ code without a one-to-one correspondence. But you can still trace through it and see what it’s doing.
Due to space, I’m not going to put the entire code here, but I’ll show you some of it. Indeed, there’s more than one version of the code inside my loop. The code does some tests, and depending on the results of the tests, runs this code:
1 movss xmm1, DWORD PTR [edx+ebx*4]2 movss xmm0, DWORD PTR [ecx+ebx*4]3 sqrtss xmm1, xmm14 sqrtss xmm0, xmm05 addss xmm1, xmm06 movss DWORD PTR [esi+ebx*4], xmm1
This is the simple scalar version. It performs the two square roots, and then moves the result into the destination. Or, if the capabilities are present, it runs either this code:
01 movups xmm0, XMMWORD PTR [edx+eax*4]02 movups xmm1, XMMWORD PTR [ecx+eax*4]03 sqrtps xmm2, xmm004 sqrtps xmm3, xmm105 cvtps2pd xmm6, xmm206 cvtps2pd xmm4, xmm307 movhlps xmm2, xmm208 movhlps xmm3, xmm309 cvtps2pd xmm7, xmm210 addpd xmm6, xmm411 cvtps2pd xmm5, xmm312 cvtpd2ps xmm0, xmm613 addpd xmm7, xmm514 cvtpd2ps xmm7, xmm715 movlhps xmm0, xmm716 movntps XMMWORD PTR [esi+eax*4], xmm0
…or a very similar one, which I won’t reprint due to space. This code is indeed doing a vectorized version. Sort of. But it’s also doing some moving around between single-precision and double-precision floating point numbers.
More Questions
This opens up additional questions. First, why the conversion between single and double precision? Also, did the processor actually detect the feature set and determine the correct approach to use? And if so, what about additional features, e.g. MMX, Advanced Vector Extensions (AVX), and so on? Can we write custom code to support those? Indeed we can. We’ll tackle that next time.
Meanwhile, I encourage you to trace through your own assembly code to see what exactly is going on, and think about whether the code is optimal. And then share your thoughts in the comments below. And as always, have fun!
_______________________________________________________________
Jeff Cogswell is a Geeknet contributing editor, and is the author of several tech books including C++ All-In-One Desk Reference For Dummies, C++ Cookbook, and Designing Highly Useable Software. A software engineer for over 20 years, Jeff has written extensively on many different development topics. An expert in C++ and JavaScript, he has experience starting from low-level C development on Linux, up through modern web development in JavaScript and jQuery, PHP, and ASP.NET MVC.







good