When you start with a serial algorithm and convert it into a parallel algorithm, you want to use the tools in Intel Parallel Studio to do a full test on your running code to make sure there are no memory and thread errors. This is the “verify” stage of development, and where the Parallel Inspector tool can help.
Let’s suppose we have an algorithm that’s going to do a character-by-character comparison of two strings of the same length. It will build a third string of the same length as the other two, and fill in the letter T where the first two strings match, and a letter F where the first two strings differ. For example, given these two strings:
1 “parallel”2 “paxallxl”
the function will detect differences at the third and seventh positions, and generate the string
1 “TTFTTTFT”
A function like this is certainly a good candidate for a parallelism. Here’s a first take of the function in serial form:
01 std::string finddiffs(std::string s1, std::string s2) {02 if (s1.length() != s2.length()) {03 return "";04 }05 06 int len = s1.length();07 std::cout << len << std::endl;08 char * x = new char[len + 1];09 for (int i=0; i<len; i++) {10 if (s1[i] == s2[i]) {11 x[i] = 'T';12 }13 else {14 x[i] = 'F';15 }16 }17 x[len] = '\0';18 19 std::string result = std::string(x);20 delete[] x;21 return result;22 }
The function takes two strings as parameters. It checks that their lengths are the same; if not, it just returns an empty string. Next, it allocates an array of characters of the size in question. It begins looping through the two strings, comparing character-by-character. With each iteration, if the two strings have matching characters, it fills a T in the target array; otherwise it fills in an F. It then adds a null-terminator ‘\0’ to the end of the array, and finally creates a new std::string instance from the array.
I wanted to find out how long a function like this takes before creating the parallel version. To do that, I used the timing tools in Parallel Amplifier. Because I wanted to really test it out, I made two enormous strings (114 million characters each). After running the code, Parallel Amplifier told me the function took 333 milliseconds to run.
Converting this to a parallel function is easy; initially all we need to do is try changing our for loop to a cilk_for, like so:
01 std::string finddiffs2(std::string s1, std::string s2) {02 if (s1.length() != s2.length()) {03 return "";04 }05 06 int len = s1.length();07 std::cout << len << std::endl;08 char * x = new char[len + 1];09 cilk_for (int i=0; i<len; i++) {10 if (s1[i] == s2[i]) {11 x[i] = 'T';12 }13 else {14 x[i] = 'F';15 }16 }17 x[len] = '\0';18 19 std::string result = std::string(x);20 delete[] x;21 return result;22 }
Running the code again under Parallel Amplifier with the same 114-million-character strings on a processor with four cores shows an improvement. The time has been reduced to 77 milliseconds. But is this code correct? I used Parallel Inspector’s Inspect Threading Errors function, with Analysis Time set all the way to the top. The test seemed to go okay, except that a warning appeared:
1 Warning: Cross-thread stack access
A cross-thread stack access warning occurs when your threads are writing to data owned by other threads. We have exactly that situation happening here; we’re writing to the array that was allocated. This can sometimes be a problem, but not necessarily. It can especially be a problem if a loop relies on the results of previous loop, where “previous” refers to the previous one when running serially. In parallel, there’s no guarantee on the order that your loops run.
For example, if we have an array like we do in our example, but each thread reads the value of a different slot in the array, and that slot was calculated from a previous loop serially, then we would have a problem when running in parallel.
The fundamental test is this: Can the loops be run in a different order and still give you the same results? In our case, even though we’re sharing data among the threads (resulting in the warning), we can see that we’re not relying on other iterations. As such, we can run any iteration of our loop in any order and get the same results. And that means we can safely ignore this warning. (It is, after all, just a warning, not an error.)
In summary, we have successfully converted our serial algorithm to parallel, and we received an improvement in performance.
Have you run into similar verification situations when coding? Please tell me about it in the Comments section below.
_______________________________________________________________
Jeff Cogswell is a Geeknet contributing editor, and is the author of several tech books including C++ All-In-One Desk Reference For Dummies, C++ Cookbook, and Designing Highly Useable Software. A software engineer for over 20 years, Jeff has written extensively on many different development topics. An expert in C++ and JavaScript, he has experience starting from low-level C development on Linux, up through modern web development in JavaScript and jQuery, PHP, and ASP.NET MVC.






