You might have heard that many CUDA operators have some kind of non-determinism, and to eliminate the randomness, one must pay for the degradation of performance. We have been warned many times by blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. This post is going to explore the problem.
When talked about GPU computation, one may have an impression of some super-fast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (by contrast the number is dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from non-deterministic execution order and data race condition.
Non-deterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across executions. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.
Data race condition is one of the consequences of non-deterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. CUDA supports provides atomic arithmetic routines like
atomicMax() for safe access to shared memory.
By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. One may argue that, we have atomic operations like
atomicAdd(), and if the program correctly sums up the same collection of numbers, although the order might not be guaranteed, it should always returns the same result. But this is wrong – some arithmetic operations does rely on the order of operands. Take the following CUDA program as an example: