CUDA Thread, Warp and SIMT
Coming from the general CPU background, it is important to understand the difference in the thread execution model between CPU threads and CUDA threads. The major difference is in how the threads are scheduled. From the software’s point of view, CPU threads (no matter they are hyperthreads or vertical threads) are executed independently. CUDA threads are scheduled in a groups of warps. The threads within a warp are executed in a somewhat lock-step way called single-instruction multiple-thread (SIMT).
From the Nvidia Compute PTX ISA 1.2 manual (p.9)
Individual threads composing a SIMT warp start together at the same program address … A warp executes one common instruction at a time, …. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
Notice that when there is a branch, the execution of the two branch paths (if both will be executed) are serialized. Say we have 32 threads in a warp and 16 of them will take branch A and the rest will take branch B, and processor chooses to execute A before B. Then none of the 16 threads on the B branch will be executed until those on branch A complete. Because of this hardware imposed ordering, one cannot assume the two branches will be executed concurrently!
As a result, programs that try to implement consumer/producer style communication within a warp between the two branches using busy-waiting loop may hang. For example, if the consumer branch is executed first, the consumer threads will loop forever because the producer threads never get a chance to execute.


