You might have heard that many CUDA operators contains some kind of nondeterminism, and to eliminate the randomness, one must pay for the degradation of performance. The warning occurs many times in blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. To this end, the post is going to explore the problem.
When talked about GPU computation, one might come up with a notion of some superfast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (compared to dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from nondeterministic execution order and data race condition.
Nondeterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across invocations. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.
Data race condition is one of the consequences of nondeterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. To alleviate, CUDA provides atomic arithmetic routines like atomicAdd()
or atomicMax()
for safe access to shared memory.
By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. But one may argue that, we have atomic operations like atomicAdd()
. If a program correctly sums up the same collection of numbers, although the order might be messed, it should always returns the same result. Sadly this is wrong, since some arithmetic operations DOES rely on the order of operands! Let’s take the following CUDA program as an example:

In this simple program, we implement a parallelized summing function with atomicAdd()
. We sum up $\{0.0,0.1,0.2,\ldots,99.8,99.9\}$ for 10 times and print out the results. The results would look like:
49949.996094 
which is completely nondeterministic with some random small errors (the accurate answer should be 49950).
The key is that, floatingpoint arithemtics are nonassociative. The equation $(a + b) + c = a + (b + c)$ does NOT hold for floatingpoint numbers $a, b, c$.
To understand this, let’s consider an extreme example – a very simple decimal floatingpoint representation with no fraction digits, i.e., numbers are represented as $\pm b \times 10^{e}$, where $b \in \{0 \ldots 9\}$. We will calculate $(0.8+1.5)1.6$ and $0.8+(1.51.6)$ under this kind of “specification”. Below are their representation:
$$ \begin{aligned} 0.8 &\rightarrow 8 \times 10^{1} \\ 1.5 &\rightarrow 2 \times 10^{0} \\ 1.6 &\rightarrow 2 \times 10^{0} \\ \end{aligned} $$For $(0.8 + 1.5)1.6$, we first shift the base of $8 \times 10^{1}$ by 1 digit and get $0.8 \times 10^{0}$. Due to an underflow, it is truncated and becomes $0 \times 10^{0}$. Then sum up $0 \times 10^{0}$ (0.8) and $2 \times 10^{0}$ (1.5) and get $2 \times 10^{0}$. Further $2 \times 10^{0}  2 \times 10^{0}$ (1.6), results in $0 \times 10^{0}$.
For $0.8 + (1.51.6)$, first calculate $2 \times 10^0$ (1.5) $2 \times 10^0$ (1.6) and get $0 \times 10^{0}$. Then $8 \times 10^{1}$ (0.8) adding with $0 \times 10^0$, results in $8 \times 10^{1}$.
The example showcases a situation that floatingpoint underflow during summation could lead to extra numeric error, and further causes nonassociativity in the arithemtic. It is simple, but can be generalized to the current IEEE 754 floatingpoint numbers. The operands’ order matters in floatingpoint arithemtics!
So here’s the answer. When running parallelized routines, GPUs have some innate randomness which comes from nondeterministic execution order. If a program consists of orderdependent operations, it will give nondeterministic output. Floatingpoint arithemtics, used widely in CUDA algorithms, are also such kind of operations, though less known by people. Atomic arithemtics solve the race condition problem, but do not guarantee the order.
If one requires determinism, the program should sequentialize the operations by using single thread or special synchronization, which will lead to a performance degrade. There is always a tradeoff between determinism and performance.
OOPS!
A comment box should be right here...But it was gone due to network issues :(If you want to leave comments, make sure you have access to disqus.com.