CUDA Thread, Warp and SIMT
Coming from the general CPU background, it is important to understand the difference in the thread execution model between CPU threads and CUDA threads. The major difference is in how the threads are scheduled. From the software’s point of view, CPU threads (no matter they are hyperthreads or vertical threads) are executed independently. CUDA threads are scheduled in a groups of warps. The threads within a warp are executed in a somewhat lock-step way called single-instruction multiple-thread (SIMT).
From the Nvidia Compute PTX ISA 1.2 manual (p.9)
Individual threads composing a SIMT warp start together at the same program address … A warp executes one common instruction at a time, …. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
Notice that when there is a branch, the execution of the two branch paths (if both will be executed) are serialized. Say we have 32 threads in a warp and 16 of them will take branch A and the rest will take branch B, and processor chooses to execute A before B. Then none of the 16 threads on the B branch will be executed until those on branch A complete. Because of this hardware imposed ordering, one cannot assume the two branches will be executed concurrently!
As a result, programs that try to implement consumer/producer style communication within a warp between the two branches using busy-waiting loop may hang. For example, if the consumer branch is executed first, the consumer threads will loop forever because the producer threads never get a chance to execute.


Thank you for such a great clarification of branches.
Now I know that if a condition of «if» statement evaluates to different values at a different threads within a warp, then execution time will be as both of branches are executed.
Thanks for such a great explanation of the concept of warps …
Though i have this doubt , that if i launch a warp of 32 threads , and out of them 30 follow branch A while only 2 follow branch B, is there some definite ordering , like first those 30 will be executed , and then the 2 , or vice versa , or is it something totally random ??
Thanx in advance for any help provided