ThreadSanitizer
Came across this data race detection tool called ThreadSanitizer today. It is based on Valgrind and yet another one that uses lockset and happen-before.
Did not have time to check it out. But the document seems pretty comprehensive.
Came across this data race detection tool called ThreadSanitizer today. It is based on Valgrind and yet another one that uses lockset and happen-before.
Did not have time to check it out. But the document seems pretty comprehensive.
In OpenCL C, a vector variable needs be aligned to the size of the vector in bytes. For example, a float4 variable needs to be aligned to a 16-byte boundary. And the ‘data’ field in the following struct also needs be aligned to a 16-byte boundary.
struct {
int flag;
float4 data;
} var;
An OpenCL C compiler will take care of the vector alignment for the variables defined within a OpenCL C program. However, it usually is not the case for variables defined in a host program. The OpenCL C vector type is not a native type in the host C, therefore the host compiler may not be aware of the alignment requirement. Mismatched access may happen when a struct variable is passed from the host to the device.
For example,
typedef struct
int flag;
float4 data;
} my_struct;
__kernel void foo(__global my_struct *in, ...)
{
... = in->data;
}
will probably break if the type of the passed in argument is defined as
typedef struct {
int flag;
float data[4];
} my_struct;
in the host program.
You need to manually pad the data structure to look like
typedef struct {
int flag;
int pad[3];
float data[4];
} my_struct;
or use the ‘cl_float4′ provided in the ‘cl_platform.h’, like
typedef struct {
cl_int flag;
cl_float4 data;
} my_struct;
[Note that some implementation may not align 'cl_float4' type correctly. You still need to RTFM :(]
There will be cases where you are not able to change the data structure. For example, you may be parallelizing a legacy program using OpenCL or you may be working with new existing data. In such cases, you can
[Following the style of my 'Common Mistakes in Using OpenMP' series, I am starting the new 'Common Mistakes in Using OpenCL'.]
In OpenCL, one can construct a vector from a set of scalars or vectors by writing a vector type followed by a parenthesized set of expressions. For example,
float4 fv4 = float4(1.0f, 2.0f, 3.0f, 4.0f);
int2 x = int2(1, 2);
int2 y = int2(3, 4);
int4 iv4 = int4(x, y);
The vector type in front of the parenthesis is important. Take a look at the following example.
int4 y = int4(10, 10, 10, 10);
int4 z = (1, 2, 3, 4) + y;
// z equals to int4(14, 14, 14, 14) at this point.
It took quite a while for a colleague and me to figure out why the vector ‘z’ was getting value int4(14, 14, 14, 14) after the second assignment.
Without a vector type, (1, 2, 3, 4) is a comma expression (OpenCL spec 1.0 rev 43, p.145 item l). Just like in C, the expression is evaluated from left to right and gets the value 4. Then the addition becomes a binary operator with a scalar operand and a vector operand. According to the promotion rule on p. 142 (item a), the scalar 4 is widened to vector int4(4, 4, 4, 4). Therefore int4(14, 14, 14, 14) is the sum!
MCFX was a skunk project I did with Liang Chen and Deepankar Bairagi at Sun. It is a C++ programming framework that aims to make writing a limited domain of parallel applications more efficiently. There are many different ways to describe what MCFX really is. Since the core of the MCFX framework is a task scheduler, I will explain what MCFX can do from the point of view of a task scheduler.
Before describing the task scheduler in MCFX, let’s take a look at Cilk, Thread Building Block and OpenMP 3.0. All of them have a task scheduler at the core. As pioneered by Cilk, these schedulers aims to improve the efficiency of parallel execution by using a unfair and cache friend scheduling scheme. Which task being picked to execute or stolen is, generally speaking, based on the execution path of the program and data affiliation. This scheme serves the purpose of being a runtime for general parallel applications well. However, tasks are individual items and their relationship that exist in the application level are not exposed to the scheduler.
MCFX scheduler is different. The scheduling decision of MCFX is based not only on the execution history of the program and cache behavior, but also on the relationship between the tasks. Tasks can have priorities; the execution of one task can cancel the execution of another task (or a set of tasks); tasks can have dependencies amongst themselves; certain tasks can be allowed to executed multiple times. These properties are exposed to the scheduler and become the critical part of the scheduling decision.
Why would we want to do this? Because in many applications, the jobs that can be executed concurrently do have such properties. And sometimes, these jobs can be executed correctly in parallel only when the non-trivial concurrency are being honored by the scheduler. Such jobs appear quite frequently in applications that use branch and bound algorithms, A* algorithms, etc..
MCFX is designed for such applications. It has a task scheduler for non-trivial parallel tasks at its core and provides high level abstract classes and templates for end users to describe the non-trivial tasks.
Liang is going to present our work at this year’s International Supercomputing Conference. A full description of MCFX will be available in the proceedings of this conference.
By now, you know that the Sun Studio Express 3/09 has this early access feature that does OpenMP 3.0 profiling.
Do you know that it also does OpenMP to CUDA transformation? To try it, simply use “-qoption iropt -Apcg:cuda41” when you compile your OpenMP program. It will generate a C source program that can be compiled and run on NVIDIA CUDA platform. The input can be any C, C++ or Fortran OpenMP program. Isn’t it cool? Currently it runs only on Linux though.
Oleg and I wrote a paper describing the techniques used to do the OpenMP 3.0 performance profiling which was mentioned in my previous blog. The paper is going to be presented at this year’s IWOMP.
In the meantime, Sun has just released Sun Studio Express 3/09.
A new collector for OpenMP 3 profiling is available, but not enabled by default. It can be enabled by setting the environment variable SP_COLLECTOR_NEWOMP, and works only for code compiled with the option -qoption iropt -Apcg:mfcxt. If you enable the new collector, the following features are available:
* The Analyzer GUI includes OMP Parallel Regions tab and OpenMP Tasks tab, which are available when OpenMP data is collected, and have entries for the source lines of each construct and metrics for those entries.
* Support has been added in the experiment format for OpenMP 3.0 profiling.
* Two new commands for OpenMP profiles, OMP_preg and OMP_task have been added.
* User mode presentation of OpenMP profiles has been changed so that the parallel loop functions are no longer shown.
Coming from the general CPU background, it is important to understand the difference in the thread execution model between CPU threads and CUDA threads. The major difference is in how the threads are scheduled. From the software’s point of view, CPU threads (no matter they are hyperthreads or vertical threads) are executed independently. CUDA threads are scheduled in a groups of warps. The threads within a warp are executed in a somewhat lock-step way called single-instruction multiple-thread (SIMT).
From the Nvidia Compute PTX ISA 1.2 manual (p.9)
Individual threads composing a SIMT warp start together at the same program address … A warp executes one common instruction at a time, …. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
Notice that when there is a branch, the execution of the two branch paths (if both will be executed) are serialized. Say we have 32 threads in a warp and 16 of them will take branch A and the rest will take branch B, and processor chooses to execute A before B. Then none of the 16 threads on the B branch will be executed until those on branch A complete. Because of this hardware imposed ordering, one cannot assume the two branches will be executed concurrently!
As a result, programs that try to implement consumer/producer style communication within a warp between the two branches using busy-waiting loop may hang. For example, if the consumer branch is executed first, the consumer threads will loop forever because the producer threads never get a chance to execute.
A short video of me talking about compiler and tools support of parallel programing in Sun Studio while I was at Sun:
My co-authored paper “The Design of OpenMP Tasks” is now featured in the March issue of IEEE Transactions on Parallel and Distributed Systems.
This article describes what happens when a multi-threaded process is shutting down, what are the differences between exit(), _exit() and pthread_exit(), when and what cleanup routines will be called. Shutting down a multi-threaded application gracefully and cleanly is a challenging task. Sometimes you do not even want to do that, but you still need to know what are happening during the shutdown. Although the description here applies to pthread on Solaris, it may shed some light on trouble shooting threading applications on other OSes. You are welcome to add your findings in the comment section.
pthread_exit() can be called by any thread (including the main thread) explicitly. It is also implicitly called when a thread returns from the thread start routine.
When the main thread calls pthread_exit(), only the main thread will terminate and other threads will survive.
When pthread_exit() is called (explicitly or implicitly), any cancellation cleanup handlers that have been pushed and not yet popped are popped in the reverse order that they were pushed and then executed. Then, if the thread has any thread specific data, the associated thread specific data destruct functions will be called in an unspecified order. The above applies to all threads except for the last alive thread in the process.
pthread_exit() does NOT call any routine installed by atexit(). This applies to all threads except for the last thread in the process.
When the last thread in the process calls pthread_exit() (all other threads in the process have been terminated by calling pthread_exit()), the thread specific data destruct functions for this thread will NOT be called, and exit() will be called.
When the main thread returns from main(), exit() (instead of pthread_exit()) is called implicitly.
When exit() (no matter who calls it) is called, all threads will be terminated. Neither thread cleanup handlers nor thread specific data destruct functions will be called.
Calling exit() (or better _exit(), see below) is a quick and dirty way to shutdown the process. When the program is broken or severely corrupted, it might not be a good idea (may core-dump or hang the process) to attempt to shutdown cleanly/gracefully.
If you want to terminate the process without invoking any routine installed by atexit(), call _exit() instead of exit().
When exit() is called, functions that are registered via atexit() are called in the reverse order of their registration.
When multiple threads call exit(), then the behavior is not well specified. For example, on Solaris 9, all the registered function will be called by only one thread (if they were registered before any exit() is called), but they may not finish because another thread that calls exit() may terminate the process. On Solaris 10, all the registered functions will be called by only one thread (if they were registered before any exit() is called), other threads that call exit() wait until all functions are called. If any called function contains an infinite loop, then the process may not terminate.
While exit() is being called, if more functions are registered via atexit() by the same thread (inside any already registered function) or by some other thread, the newly registered functions maybe called by different threads.
A reader asked why concurrency programming is not a super-set of parallel programming since the parallel entities are also concurrent. Well, it is just like black-white vs color photography. Though black and white are two colors, the techniques in taking good black-white pictures are different from those for color pictures. One need to think and see differently in terms of contrast, texture, lighting and even composition.
Now back to our programming world. Recently while I was working on the OpenMP profiling, I fixed a concurrency bug that was related to asynchronous signals and had nothing to do with parallelism. I used a data structure to store the OpenMP context of a thread. Since an OpenMP context can be described in a tuple <current parallel region, current task region, OpenMP state, user callstack>, the data structure has several 64-bit long fields. One challenge is to update the context data structure atomically, i.e. when my program needs to report the OpenMP context, it should report a consistent context. For example, it should not report a thread is in a new parallel region but is still in an old task region. The atomicity here has nothing to do with parallelism here - the context data is thread private, so there is no sharing between different threads and there is no data race. The atomicity issue happens when a profiling signal (SIGPROF) comes while the program is in the middle of updating the fields of the context data structure. At the signal handler, the program needs to report the context and need to report them consistently. In the end, I had to crafted a way to update all the fields atomically (asynchronously safe) without masking out the SIGPROF.
Here is another interesting discussion on concurrency vs parallelism. I checked the manual. The exact wording used is “The maximum number of active threads per multiprocessor is 768″.
Last Tuesday at the OpenMP BOF of SC08, Oleg Mazurov presented our work on extending the OpenMP profiling API for OpenMP 3.0 (pdf slides).
The current existing API was first published in 2006 and was last updated in 2007. Since then, two more developments now beg for another update - one is for supporting the new OpenMP tasking feature, and the other is for supporting vendor specific extensions.
The extension for tasking support is straight forward. A few events that corresponding to the creation, execution, and termination of tasks are added. Also added are a few requests to get the task ID and other properties.
Vendor specific extensions are implemented essentially by sending a establishing-extension request with a vendor unique ID from the collector tool to the OpenMP runtime library. The OpenMP runtime library accepts the request if it supports the vendor, otherwise rejects it. After a successful rendezvous, the request establishes a new name space for subsequent requests and events.
One pending issue is how to support multiple vendor agents in one session. Not that a solution cannot be engineered, we are waiting for a use case to emerge.
During the execution of an OpenMP program, any arbitrary program event can be associated with
Because the execution of an OpenMP task may be asynchronous, and the executing thread may be different from the encountering thread, getting the user callstack of an event happened within a task becomes tricky.
At our Sun booth in SC08, we demoed a prototype Performance Analyzer that can present user callstacks in a cool way when OpenMP tasks are involved.
Take a simple quick sort code for an example.
void quick_sort ( int lt, int rt, float *data ) {
int md = partition( lt, rt, data );
#pragma omp task
quick_sort( lt, md - 1, data );
#pragma omp task
quick_sort( md + 1, rt, data );
}
The following figure shows the time line display of one execution of the program. The same original data are sorted three times, once sequential, once using two threads, and once using four threads.

The spikes in callstacks in the sequential sort show the recursive nature of the quick sort. But when you look at the parallel sort, the callstacks are flat. That’s because each call to
While these pieces of information are useful in showing the execution details, they do not help answering the question which tasks are actually being executing. Where was the current executing task created? In the end, the user needs to debug the performance problem in his/her code (not in the OpenMP runtime). Representing information close to the user program logic is crucial.
The following figure shows the time line and user callstacks in the user view constructed by our prototype tool. Notice the callstacks in the parallel run are almost the same as in the sequential run. In the time line, it is just like the work in the sequential run is being distributed among the threads in the parallel run. Isn’t this what happens intuitively when you parallelize a code using OpenMP?

The Sun Studio Express 11/08 is out by now and can be downloaded for free.
Among many interesting and important features it provides, here are a few I would like to list
Today, we are making available, as a free download, Sun Studio Express 07/08 Release. One of the most exciting things about this release is the beta-level support for OpenMP 3.0 in our C/C++/Fortran Compilers.
I feel really excited about this. One of the major 3.0 features supported is tasking, which was finalized in the language specification after a looooong labor. It expends a whole new dimension of what OpenMP can do. It is like a new piece of LEGO. We are looking forward to seeing innovative (or not :)) ways of using this new feature.
This is a functional beta release. We are still working on fixing a few bugs and improving performance. One of the best ways to give us feedback is using our online forum.
Here are two short articles that may help users jump-start using the tasking feature.
Gulf of Execution is a term used to describe the the difference between the steps one actually needs to take to achieve a goal and the steps that one perceives.
After learning this term, the example that quickly jumps into my mind is setting up those wifi-enabled devices, like Wii, PSP, NDS, Wireless gateway, etc. In my experience, the one with the narrowest gap is iPhone. The worst one is, well, some operating system.
Michael G Schwern had a blog >entry about this on Perl.
Who wide is the Gulf in your favorite parallel programming language/model/scheme/library?
When visiting the IBM booth at SC07, I was a little surprised to learn that my non-concurrency analysis technology for OpenMP programs had also been adopted and implemented in the Parallel Tools Platform.
Beth Tibbitts from IBM has kindly sent me the reference details: STMCS’07 program, paper, and presentation.
The technology is used by Sun Studio Compilers to do static error check for OpenMP programs.
Adam Kolawa (Parasoft) said in his recent article on DDJ,
“Many people … want tools to find these bugs automatically. After 20 years of examining how and why errors occur, I believe this is the wrong response. Only a small class of errors can be found automatically; most bugs are related to functionality and requirements, and cannot be identified with just the click of a button.”
and
“Our current mission is to address this problem by inventing technologies and strategies to support the brain as it performs this evaluation. We are building automated infrastructures that provide maximum automation for mundane tasks (compiling code, building/running regression test suites, checking adherence to policies, supporting code reviews, and so on) in such a way that each day the brain is presented with the minimal information needed to determine if yesterday’s code modifications negatively impacted the application.”
There is probably no magic button one can push to turn a piece of legacy code that is not thread-safe into a thread-safe code. A tool should offload the mundane tasks from human brain which can be set free to finish the magic touch.
The C10K problem refers to the problem of serving ten thousand clients simultaneously on a web server. This article written by Daniel Kegel contains some history and background information (and lots links) about threading on Linux, Solaris, BSD, MAC OS X, etc.
There is no synchronization between the threads in a team when they enter a worksharing construct. Many people assume there is a barrier before the threads enter a worksharing construct, especially when there is a FIRSTPRIVATE used in the worksharing construct. This is a common mistake.
For example, in the following code, assume two threads - thread 1 and thread 2 are in the team, and Read1 is executed by thread 1 and Read2 is executed by thread 2.
#pragma omp parallel
{
if (omp_get_thread_num()==0)
z = 1;
else
z = 2;
#pragma omp sections firstprivate(z)
{
#pragma omp section
{
... = z; // Read1
}
#pragma omp section
{
... = z; // Read2
}
}
}
What are the values of
If there were a synchronization before the worksharing construct, then the above (Read1:1, Read2:2) is not possible.
Now, look at the following example which has both FIRSTPRIVATE and LASTPRIVATE,
#pragma omp parallel
{
z = 1;
#pragma omp for firstprivate(z) lastprivate(z) nowait
for (i=0; i<n; i++) {
... = z; // Read1
z = 2; // Write1
}
}
What could be the value of
If a list item appears in both firstprivate and lastprivate clauses, the update
required for lastprivate occurs after all initializations for firstprivate.
So, the value of

In this blog entry, I will describe my experiment of the test cases with Helgrind. Helgrind is a data race detection module of Valgrind, which is pretty successful framework and tool suite for debugging and profiling Linux programs.
Unlike other runtime checking tools I will describe later (e.g. Intel’s Thread Checker and Sun’s DRDT), Valgrind is simulation based. One advantage of simulation based approach is the two active entities - the target application and the detection module are in different processes. They have different address spaces and name spaces. Therefore this approach can avoid many conflicts between the two entities. For example, the detection module can call any library routines that it monitors without worrying about re-entry problems. [Update: Valgrind actually runs in the same namespace as the target application. And the target application and the detection module are part of the same process. The detection module (core and tools) are designed carefully to avoid dependence on glibc.so.] One challenge of simulation based approach is dealing with system calls. The simulation based approach simulates only the execution of the user process, and it is NOT simulating the OS. A even more bigger challenge is to deal with threading calls. Valgrind is not multi-threaded itself, and all threading executions are serialized. I have not got a chance to study how it works. It must be very interesting. Valgrind’s manual claims it works with NPTL or LinuxThreads “well enough for significant threaded applications”.
Helgrind is based on the famous Eraser method enhanced with detection of thread creation and thread join. The method is very similar to that used in Compaq/HP’s Visual Threads (as described in Harrow’s paper). Lockset based methods (such as Eraser) tend to have a lot of false positives.
Currently Valgrind is at release 3.2.0. But the latest version that Helgrind works is 2.2.0. When I ran Helgrind in 3.2.0, I got
Helgrind is currently not working, because:
(a) it is not yet ready to handle the Vex IR and the use with 64-bit
platforms introduced in Valgrind 3.0.0
(b) we need to get thread operation tracking working again after
the changes added in Valgrind 2.4.0
If you want to use Helgrind, you'll have to use Valgrind 2.2.0, which is
the most recent Valgrind release that contains a working Helgrind.
Sorry for the inconvenience. Let us know if this is a problem for you.
Then I swithced to 2.2.0. First I tried with pthr_prime.c.
$ cc -g pthr_prime.c -lm -lpthread -o pthr_prime $ valgrind --tool=helgrind ./pthr_prime ==32368== Helgrind, a data race detector for x86-linux. ==32368== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al. ==32368== Using valgrind-2.2.0, a program supervision framework for x86-linux. ==32368== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al. ==32368== For more details, rerun with: -v ==32368== ==32368== Thread 2: ==32368== Possible data race writing variable at 0x80498B0 (total) ==32368== at 0x80486BD: work (pthr_prime.c:51) ==32368== by 0x1D4AFE79: thread_wrapper (vg_libpthread.c:867) ==32368== by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2) ==32368== Address 0x80498B0 is in data section of /home/yl140942/tmp/vg/a.out ==32368== Previous state: shared RO, no locks ==32368== ==32368== Possible data race writing variable at 0x57EFE95C ==32368== at 0x804877D: main (pthr_prime.c:75) ==32368== Address 0x57EFE95C == &(i) at pthr_prime.c:75 ==32368== Previous state: shared RO, no locks ==32368== ==32368== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2) ==32368== 4 possible data races found; 0 lock order problems
Helgrind finds the race access of total at line 51 and race access of i at line 75. Note that a data race is caused by a pair of accesses. Helgrind reports only one access of a pair. For the first one, the report is ok because line 51 reads and updates total, therefore it is fairly easy to guess what are the racing access pairs. For the second one (i at line 75), I would imagine it would take a fair large of amount of time for one to figure out the other race access of the pair is in line 46. Helgrind also misses several data races (e.g. write-write race at line 50, write-read race between line 50 and 76) due to the heuristic it uses.
Next, I tried with pthr_prime_fixed.c.
$ cc -g pthr_prime_fixed.c -lm -lpthread -o pthr_prime_fixed $ valgrind --tool=helgrind ./pthr_prime_fixed ==21596== Helgrind, a data race detector for x86-linux. ==21596== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al. ==21596== Using valgrind-2.2.0, a program supervision framework for x86-linux. ==21596== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al. ==21596== For more details, rerun with: -v ==21596== ==21596== Thread 2: ==21596== Possible data race writing variable at 0x804CA10 (pflag+16) ==21596== at 0x80486DC: is_prime (pthr_prime_fixed.c:34) ==21596== by 0x8048756: work (pthr_prime_fixed.c:50) ==21596== by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867) ==21596== by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2) ==21596== Address 0x804CA10 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed ==21596== Previous state: exclusively owned by thread 1 ==21596== ==21596== Thread 2: ==21596== Possible data race writing variable at 0x804CA18 (pflag+24) ==21596== at 0x80486DC: is_prime (pthr_prime_fixed.c:34) ==21596== by 0x8048756: work (pthr_prime_fixed.c:50) ==21596== by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867) ==21596== by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2) ==21596== Address 0x804CA18 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed ==21596== Previous state: exclusively owned by thread 1 <similar messages repeated for various pflag+offset> ==21596== Thread 2: ==21596== Possible data race writing variable at 0x804CAC0 (pflag+192) ==21596== at 0x80486DC: is_prime (pthr_prime_fixed.c:34) ==21596== by 0x8048756: work (pthr_prime_fixed.c:50) ==21596== by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867) ==21596== by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2) ==21596== Address 0x804CAC0 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed ==21596== Previous state: exclusively owned by thread 1 ==21596== ==21596== ==21596== Possible data race reading variable at 0x80499B0 (total) ==21596== at 0x8048873: main (pthr_prime_fixed.c:80) ==21596== Address 0x80499B0 is in data section of /home/yl140942/tmp/vg/pthr_prime_fixed ==21596== Previous state: shared RW, locked by:0x80499B4(mutex) ==21596== ==21596== ERROR SUMMARY: 33 errors from 33 contexts (suppressed: 2 from 2) ==21596== 35 possible data races found; 0 lock order problems
This time Helgrind reports 32 races accesses of pflag[] at line 34. As explained in DRDT tutorial, these are benign data races. Helgrind also reports a false positive race that has an access of total at line 80.
Helgrind does a good job of reporting the name of the variable involved in the data races (e.g. total, pflag[] and i) and the lock variables (e.g. mutex). The Previous state gives a hint why Helgrind thinks an access might cause data race. For example, in the above experiment with pthr_prime_fixed.c, for the access of total at line 80, it says “Previous state: shared RW, locked by:0×80499B4(mutex)“. The accesses of total at lines 52-53 are protected by mutex locks. When Helgrind finds the read access of total is not protected by the same lock (or any lock in this case), it reports a possbile data race. The detection of the thread_join sometime did not work to get rid of the false positive though.
This blog entry begins to describe a couple of currently available tools that detect data races in multi-threaded C/C++/Fortran programs. These tools and the categories they can be roughly put into are
What not covered here are the tools from some research work. Some of them use combined static and runtime methods, and some use post-mortem based approaches.
I will reuse the following four code examples from the Tutorial of Using Sun Data Race Detection Tool. If you have downloaded and installed the Sun Studio Express June 2006, you should be able to find the example codes under
{installed-directory}/opt/SUNWspro/examples/rdt/prime.
All four codes find the prime numbers between 2 and 3000 using 4 threads. An OpenMP version and a Pthread version are provided,
Read the Tutorial to find out what the data races are and how the bugs are fixed.
omp_prime.c
...
12 #include <stdio.h>
13 #include <math.h>
14 #include <omp.h>
15
16 #define THREADS 4
17 #define N 3000
18
19 int primes[N];
20 int pflag[N];
21
22 int is_prime(int v)
23 {
24 int i;
25 int bound = floor(sqrt(v)) + 1;
26
27 for (i = 2; i < bound; i++) {
28 /* no need to check against known composites */
29 if (!pflag[i])
30 continue;
31 if (v % i == 0) {
32 pflag[v] = 0;
33 return 0;
34 }
35 }
36 return (v > 1);
37 }
38
39 int main(int argn, char **argv)
40 {
41 int i;
42 int total = 0;
43
44 #ifdef _OPENMP
45 omp_set_num_threads(THREADS);
46 omp_set_dynamic(0);
47 #endif
48
49 for (i = 0; i < N; i++) {
50 pflag[i] = 1;
51 }
52
53 #pragma omp parallel for
54 for (i = 2; i < N; i++) {
55 if ( is_prime(i) ) {
56 primes[total] = i;
57 total++;
58 }
59 }
60 printf("Number of prime numbers between 2 and %d: %d\n",
61 N, total);
62 for (i = 0; i < total; i++) {
63 printf("%d\n", primes[i]);
64 }
65 }
pthr_prime.c
...
12 #include <stdio.h>
13 #include <math.h>
14 #include <pthread.h>
15
16 #define THREADS 4
17 #define N 3000
18
19 int primes[N];
20 int pflag[N];
21 int total = 0;
22
23 int is_prime(int v)
24 {
25 int i;
26 int bound = floor(sqrt(v)) + 1;
27
28 for (i = 2; i < bound; i++) {
29 /* no need to check against known composites */
30 if (!pflag[i])
31 continue;
32 if (v % i == 0) {
33 pflag[v] = 0;
34 return 0;
35 }
36 }
37 return (v > 1);
38 }
39
40 void *work(void *arg)
41 {
42 int start;
43 int end;
44 int i;
45
46 start = (N/THREADS) * (*(int *)arg) ;
47 end = start + N/THREADS;
48 for (i = start; i < end; i++) {
49 if ( is_prime(i) ) {
50 primes[total] = i;
51 total++;
52 }
53 }
54 return NULL;
55 }
56
57 int main(int argn, char **argv)
58 {
59 int i;
60 pthread_t tids[THREADS-1];
61
62 for (i = 0; i < N; i++) {
63 pflag[i] = 1;
64 }
65
66 for (i = 0; i < THREADS-1; i++) {
67 pthread_create(&tids[i], NULL, work, (void *)&i);
68 }
69
70 i = THREADS-1;
71 work((void *)&i);
72
73 printf("Number of prime numbers between 2 and %d: %d\n",
74 N, total);
75 for (i = 0; i < total; i++) {
76 printf("%d\n", primes[i]);
77 }
78 }
omp_prime_fixed.c
...
12 #include <ststdio.h>
13 #include <math.h>
14 #include <pthread.h>
15
16 #define THREADS 4
17 #define N 3000
18
19 int primes[N];
20 int pflag[N];
21 int total = 0;
22 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
23
24 int is_prime(int v)
25 {
26 int i;
27 int bound = floor(sqrt(v)) + 1;
28
29 for (i = 2; i < bound; i++) {
30 /* no need to check against known composites */
31 if (!pflag[i])
32 continue;
33 if (v % i == 0) {
34 pflag[v] = 0;
35 return 0;
36 }
37 }
38 return (v > 1);
39 }
40
41 void *work(void *arg)
42 {
43 int start;
44 int end;
45 int i;
46
47 start = (N/THREADS) * ((int)arg) ;
48 end = start + N/THREADS;
49 for (i = start; i < end; i++) {
50 if ( is_prime(i) ) {
51 pthread_mutex_lock(&mutex);
52 primes[total] = i;
53 total++;
54 pthread_mutex_unlock(&mutex);
55 }
56 }
57 return NULL;
58 }
59
60 int main(int argn, char **argv)
61 {
62 int i;
63 pthread_t tids[THREADS-1];
64
65 for (i = 0; i < N; i++) {
66 pflag[i] = 1;
67 }
68
69 for (i = 0; i < THREADS-1; i++) {
70 pthread_create(&tids[i], NULL, work, (void *)i);
71 }
72
73 i = THREADS-1;
74 work((void *)i);
75
76 for (i = 0; i < THREADS-1; i++) {
77 pthread_join(tids[i], NULL);
78 }
79
80 printf("Number of prime numbers between 2 and %d: %d\n",
81 N, total);
82 for (i = 0; i < total; i++) {
83 printf("%d\n", primes[i]);
84 }
85 }
pthr_prime_fixed.c
...
12 #include <stdio.h>
13 #include <math.h>
14 #include <pthread.h>
15
16 #define THREADS 4
17 #define N 3000
18
19 int primes[N];
20 int pflag[N];
21 int total = 0;
22 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
23
24 int is_prime(int v)
25 {
26 int i;
27 int bound = floor(sqrt(v)) + 1;
28
29 for (i = 2; i < bound; i++) {
30 /* no need to check against known composites */
31 if (!pflag[i])
32 continue;
33 if (v % i == 0) {
34 pflag[v] = 0;
35 return 0;
36 }
37 }
38 return (v > 1);
39 }
40
41 void *work(void *arg)
42 {
43 int start;
44 int end;
45 int i;
46
47 start = (N/THREADS) * ((int)arg) ;
48 end = start + N/THREADS;
49 for (i = start; i < end; i++) {
50 if ( is_prime(i) ) {
51 pthread_mutex_lock(&mutex);
52 primes[total] = i;
53 total++;
54 pthread_mutex_unlock(&mutex);
55 }
56 }
57 return NULL;
58 }
59
60 int main(int argn, char **argv)
61 {
62 int i;
63 pthread_t tids[THREADS-1];
64
65 for (i = 0; i < N; i++) {
66 pflag[i] = 1;
67 }
68
69 for (i = 0; i < THREADS-1; i++) {
70 pthread_create(&tids[i], NULL, work, (void *)i);
71 }
72
73 i = THREADS-1;
74 work((void *)i);
75
76 for (i = 0; i < THREADS-1; i++) {
77 pthread_join(tids[i], NULL);
78 }
79
80 printf("Number of prime numbers between 2 and %d: %d\n",
81 N, total);
82 for (i = 0; i < total; i++) {
83 printf("%d\n", primes[i]);
84 }
85 }
Static checking tools find data races in a program without actually executing the program.
The static checking approach has three advantages, as compared with runtime based approachs.
Because of the above advantages, static checking can be used in situations where it is very difficult or impossible to get a runtime experiment or where it is very difficult or impossible to get a precise runtime experiment without altering the runtime result, such as OS kernels and device drivers.
The biggest disadvantage of static checking is the large amount of false positives it may generate. Static checking is always puzzled by imprecise information due pointer aliasing and vague execution paths.

Sun Studio provides a utility called LockLint, which analyzes the use of mutex and reader/writer locks, and reports data races and deadlocks due to inconsistent use of locking techniques.
LockLint reports a data race when accesses to a variable are not consistently protected by at least one lock, or accesses violate assertions about which locks protect the variable.
LockLint originates from WARLOCK, which was designed to detect data races and deadlocks in Solaris kernels and device drivers. Search for warlock in opensolaris.org, and you can still find the use of it there.
The following shows the result of using LockLint on pthr_prime.c. Notice the false positive at line 63, and false negative with respect to variable i.
$ cc -mt -Zll pthr_prime.c
$ lock_lint start
$ lock_lint load pthr_prime.ll
$ lock_lint analyze -v
* Warning: A main function was loaded with no annotations to indicate the
presence or absence of concurrency. Lock_lint will assume concurrency.
Please annotate source with:
NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)
* Writable variable read while no locks held!
variable = :pflag
where = :is_prime [pthr_prime.c,30]
* Variable written while no locks held!
variable = :pflag
where = :is_prime [pthr_prime.c,33]
* Variable written while no locks held!
variable = :pflag
where = :main [pthr_prime.c,63]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime.c,74]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime.c,75]
* Writable variable read while no locks held!
variable = :primes
where = :main [pthr_prime.c,76]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime.c,77]
* Writable variable read while no locks held!
variable = :total
where = :work [pthr_prime.c,50]
* Variable written while no locks held!
variable = :primes
where = :work [pthr_prime.c,50]
* Variable written while no locks held!
variable = :total
where = :work [pthr_prime.c,51]
The following shows the result of using LockLint on pthr_prime_fixed.c. Notice that the data races in routine work() are now gone, but the false positives and the false negatives in the previous experiment with pthr_prime.c are still there.
$ cc -mt -Zll pthr_prime_fixed.c
$ lock_lint start
$ lock_lint load pthr_prime_fixed.ll
$ lock_lint analyze -v
* Warning: A main function was loaded with no annotations to indicate the
presence or absence of concurrency. Lock_lint will assume concurrency.
Please annotate source with:
NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)
* Writable variable read while no locks held!
variable = :pflag
where = :is_prime [pthr_prime_fixed.c,31]
* Variable written while no locks held!
variable = :pflag
where = :is_prime [pthr_prime_fixed.c,34]
* Variable written while no locks held!
variable = :pflag
where = :main [pthr_prime_fixed.c,66]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime_fixed.c,81]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime_fixed.c,82]
* Writable variable read while no locks held!
variable = :primes
where = :main [pthr_prime_fixed.c,83]
* Writable variable read while no locks held!
variable = :total
where = :main [pthr_prime_fixed.c,84]
LockLint provides a rich set of source code notations and interactive subcommands that can be used to provide more precise information to LockLint so to improve the analysis.

Strickly, this is not a tool. It is a compile-time check option provided in Sun Studio Fortran and C compilers. The following is from the man page of the cc command.
-xvpara
Show parallelization warning messages
Issues warnings about potential parallel programming
related problems that may cause incorrect results when
using OpenMP or Sun/Cray parallel directives and prag-
mas.
Use with -xopenmp and OpenMP API directives, or with
-explictpar and MP parallelization directives.
Warnings are issued when the compiler detects the fol-
lowing situations:
o Loops that are parallelized using MP directives when
there are data dependencies between different loop
iterations
o Problematic use of OpenMP data sharing attributes
clauses, such as declaring a variable "shared" whose
accesses in an OpenMP parallel region may cause data
race, or declaring a variable "private" whose value in
a parallel region is used after the parallel region.
In short, when -xvpara is used as an option to compile an OpenMP program, the compiler is able to report problems in the source code caused by incorrect use of data sharing attribute clause. One typical problem is data race introduced by incorrectly declaring a variable “shared”.
When using vpara checking on the omp_prime.c, the compiler finds the data race between the write accesses to variable total at line 57 by different threads, as illustrated below. The checking analyzes the code enclosed lexically inside an OpenMP parallel region only, therefore it does not find data races in routine is_prime(). The checking also misses the data race on array primes[] due to a technique to reduce false positives. Unfortunately, the technique introduces a false negative here.
$ cc -xopenmp -xO3 -xvpara omp_prime.c -lm
"omp_prime.c", line 53: Warning: inappropriate scoping
variable 'total' may be scoped inappropriately as 'shared'
. write at line 57 and write at line 57 may cause data race
$ cc -xopenmp -xO3 -xvpara omp_prime_fixed.c -lm
$
The vpara compile-time checking is based on the static non-concurrency analysis techniques for OpenMP programs, which is also used by the OpenMP autoscoping feature provided in Sun Studio compilers.
This is the first of a series of blogs on understanding data races I am going to post.
With the release of Sun Studio Express (June 2006 Build), we are offering a run-time data race detection tool (DRDT) for developers on Sun’s platforms for FREE. It compliments other data race detection tools Sun already offers now.
If you have been bugged by data race problems in the past, you should give it a try. Go here (scroll to ‘How to get started’) to download it. And here is the page dedicated to the DRDT project.
I would like to start the series with understanding the role data race detection tools first.
Many mt programs have race conditions, the existence of which makes debugging mt programs very hard. One class of race conditions is data race condition or data race. (The difference between general race condtion and data race condition will be explained in another blog.)
Data race is a condition that happens in a program. People often think a data race is always a bug. This is not true. A data race could be the root cause of a bug; it could be caused by a bug; or it could be there because the programmer wants it there.
If a data race is the root cause of a bug, we want to find it. If a data race is caused by a bug, showing where the data race is can help the programmer locate the real bug. If a data race is there by design, we want to make sure it is there and we also want to make sure there is no unexpected data race.
The role of a data race detection tool is to check whether a program contains data races and pin-point the locations of them if there is any.
There are many ways of using a data race detection tool. Some use it as debugging tool: run it when there is a bug in the program. Someone use it as a sanity checking tool: run it as part of regression tests. And some use it as a programming assistance tool in parallelizing sequential programs: find thread unsafe routines and global variables that should be private to threads.
Sun’s OpenMP implementation supports true nested parallel regions - when nested parallelism is enabled, the inner parallel region can be executed by multiple threads concurrently.
We provide an environment variable called SUNW_MP_MAX_POOL_THREADS for users to control the total number of OpenMP slave threads in a process.
For example, if you have want a maximum of 16 threads to be used for a nest of parallel regions in your program, you can set SUNW_MP_MAX_POOL_THREADS to 15. That’s 15 slave threads (some of them may become masters in inner parallel regions) plus one user thread which is the master thread for the out-most parallel region.
Why did we design an environment variable like SUNW_MP_MAX_NUM_THREADS so that a user can set it to 16 in the above example? Intel’s implementation has KMP_ALL_THREADS and KMP_MAX_THREADS which do that.
Well, we were trying to have a scheme that works on more general cases, not just pure OpenMP codes. In particular, we think our scheme works better than others for mixed pthread and OpenMP thread code. The pool defines a set of threads that can be used as OpenMP slave threads. If the program has two pthreads and both will create a team, then both will try to grab slave threads from the same pool. The env var SUNW_MP_MAX_POOL_THREADS was NOT designed for users to control the total number of threads in a process. We cannot control that because of the use of pthreads. The env var is designed for users to control the total number of OpenMP slave threads.
The env var SUNW_MP_MAX_NUM_THREADS is documented here. We also have a short article “How Many Threads Does It Take?” if you want to understand it better.
More precisely, this mistake should be classified as a common mis-understanding of OpenMP.
When a worksharing construct, such omp for or omp sections, is encountered outside any explicit parallel region, the arising worksharing region is called orphaned worksharing region. A common mis-understanding is that in this case the worksharing construct is simply being ignored and the region is executed sequentially.
Orphaned worksharing constructs are not ignored. All the data sharing attribute clauses are honored. The worksharing regin is executed as if a team of only one thread is executing the region.
For example, in the following C++ code,
main()
{
class_type_1 a;
#pragma omp for private(a) schedule(dynamic)
for (i=1; i<100; i++) {
printf("%dn", i);
}
}
the default constructor for class_type_1 will be called, and a comforming implementation is not forced to execute the loop in the order of 1, 2, 3, …, 99.
In the danger of hairsplitting, …
Concurrency and parallelism are NOT the same thing. Two tasks T1 and T2 are concurrent if the order in which the two tasks are executed in time is not predetermined,
If two concurrent threads are scheduled by the OS to run on one single-core non-SMT non-CMP processor, you may get concurrency but not parallelism. Parallelism is possible on multi-core, multi-processor or distributed systems.
Concurrency is often referred to as a property of a program, and is a concept more general than parallelism.
Interestingly, we cannot say the same thing for concurrent programming and parallel programming. They are overlapped, but neither is the superset of the other. The difference comes from the sets of topics the two areas cover. For example, concurrent programming includes topic like signal handling, while parallel programming includes topic like memory consistency model. The difference reflects the different orignal hardware and software background of the two programming practices.
The coming International Workshop on OpenMP (IWOMP 2006) has a paper titled “Common Mistakes in OpenMP and How to Avoid Them” written by Michael Süß and Claudia Leopold (University of Kassel, Germany).
The result is based on a survey of two undergraduate courses. The authors of the paper kindly allow me to list the 15 common mistakes presented in their paper here,
For detail, please read the full paper.
The June 2006 issue (Vol 4, No 5) of ACM Queue features an aritcle by Michi Henning of ZeroC on the rise and fall of CORBA.
Technical issues and procedural issues contribute to the fall of CORBA. And the procedural problems are the root cause of the procedural problems. Many of the issues the article points out are alarming familiar!
The following is a list of lessons learnt in how to have a better standards process,
Read the whole article.
The following code finds good members in array
member[] and stores the indices of the good members in array
good_members[].
#define N 1000
struct data member[N];
int good_members[N];
int pos = 0;
void find_good_members()
{
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i;
pos ++;
}
}
}
The following is a navie way of parallelizing the above code,
#define N 1000
struct data member[N];
int good_members[N];
int pos = 0;
void find_good_members()
{
#pragma omp parallel for
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i; // line a
#pragma omp atomic
pos ++; // line b
}
}
}
In order to avoid data races between different updates of
global variable pos, the code puts the increment (at line b) in a
atomic construct. However, the code does not work, because there is a
data race between the read of pos at line a and write of pos at line b.
Changing the body of the if statement to the following gives the correct result.
int mypos;
#pragma omp critical
{
mypos = pos;
pos ++;
}
good_members[mypos] = i;
In OpenMP 2.5 (the latest Specification), inside a parallel region, the only place where you can safely get the
value of a variable that is updated in an atomic region is another
atomic region.
In C/C++, OpenMP directives are specified by using the #pragma mechanism; and in Fortran, they are specified by using special comments that are identified by unique sentinels.
This design allows users to write OpenMP programs that can be compiled with compilers that do not support OpenMP or compiled with OpenMP compiles with OpenMP support disabled.
However, if you do not follow the directive format, you might get a
program that compiles and runs but gives unexpected results, because
the compiler does not recognize your OpenMP directives and thinks they are non-OpenMP related pragmas (C/C++) or regular comments (Fortran).
Quiz:
How many “me”s does the following code print? Assume a team of 4 threads are executing the parallel region.
foo()
{
#pragma omp parallel
{
#pragma single
{
printf("men");
}
}
}
I will post a list of common mistakes found in parallel programs written using OpenMP.
Although it is always true that users of a language need to spend effort to
understand the language so to avoid mistakes, I wonder what it means to
the language designers if many many users keep making the same set of
mistakes again and again.
The following articles from the ACM Queue Microprocessors issue (vol. 3, no. 7 - September 2005) are must reads.

Multicore CPUs for the Masses
Mache Creeger, Emergent Technology Associates
Software and the Concurrency Revolution
Herb Sutter and James Larus, Microsoft
The Price of Performance
Luiz André Barroso, Google
Extreme Software Scaling
Richard McDougall, Sun Microsystems
The Future of Microprocessors
Kunle Olukotun and Lance Hammond, Stanford University