What goes on under the hood—at the assembly level—when you add vectorization to your code? Because it’s a bit involved, I’m first going to show you how to get to the assembly code. Then in my next blog post, I’ll walk you through the assembly code.
Multiple Functions Generated
The various documentation for vectorization suggests that multiple functions are generated, and that if the code runs on a processor that supports vectorization, a non-vectorized version will be called instead. Since the compiler only generates a single executable, and the code in that executable doesn’t change without re-compilation, that means the different versions of the function must all exist within the code. But how does that work? I was curious myself, so I looked at the assembly code.
The easiest way to look at the assembly code is by setting a compiler option to view the assembly generated by the compiler. Right-click on the project, go to the properties page, and in the C/C++ section under Output Files, set “Assembler Output” to “Assembly With Source Code (/FAs)”. When you do that, you’ll get an .asm file in the same directory as the executable.
Try This Trick
Looking at the assembly is fine, but I want to say something about tracing into the assembly code with the debugger so you can actually step through it. That turned out to be somewhat difficult, because I wanted to build an optimized, non-debug version of the code. Visual Studio was choking on me if I tried to launch the debugger while in Release configuration; I was getting an exception way down in the bowels of Windows, inside the ntdll.dll library. To get around this problem, I did a little trick, or hack, if you will. At the beginning of my _tmain, I put these two lines:
1 char c; 2 std::cin >> c;
Then, I launched the program without debugging. The program would pause, awaiting keyboard input. Next, I started a second instance of Visual Studio, which didn’t have a solution loaded. From that second instance, I attached to my running process (Debug -> Attach To Process). Now with the program frozen as it was awaiting keyboard input, from the debugger I did Debug -> Break All. A message box appeared that said “The process appears to be deadlocked,” along with some other information, which was no problem. But I was in my code. In the Call Stack window, I found the first function down that was inside the _tmain. I double-clicked it and was looking at the C++ code for my cin call.
Now the fun part. I scrolled up to my vectorized loop, and set a breakpoint. I pressed F5 to let the program resume. Then in the console window, I had to type a character and press Enter to get past the cin call.
My code then stopped at the right place. Perfect! And then: Right-click and do Go To Disassembly. (If you see a list of functions, just choose the first one.) There it is. Now you can trace through it using regular debugging techniques. I don’t have space here for a full explanation, but just want to also point out that you can look at the registers by Debug -> Windows -> Registers.
Finally, when you look at the assembly code, you’ll see that the code is intermixed with the original C++ code. That makes it a lot easier to know where you are in your code, as well as see the relationship between the original C++ code and the generated assembly code.
In the forthcoming Part Two of this blog, we’ll step through the code and see what it’s doing. But if you’re eager to go on ahead, pay close attention to some of the interesting op-codes and registers like these:
1 00D6154F movss xmm1,dword ptr [ecx+eax*4] 2 00D61554 movss xmm0,dword ptr [edx+eax*4] 3 00D61559 sqrtss xmm1,xmm1 4 00D6155D sqrtss xmm0,xmm0 5 00D61561 addss xmm1,xmm0
These are vectorized forms of op codes, and that’s what we’ll explore next time in Part Two.