Demystify the randomness in CUDA kernels

You might have heard that many CUDA operators contains some kind of non-determinism, and to eliminate the randomness, one must pay for the degradation of performance. The warning occurs many times in blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. To this end, the post is going to explore the problem.

When talked about GPU computation, one might come up with a notion of some super-fast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (compared to dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from non-deterministic execution order and data race condition.

Non-deterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across invocations. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.

Data race condition is one of the consequences of non-deterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. To alleviate, CUDA provides atomic arithmetic routines like atomicAdd() or atomicMax() for safe access to shared memory.

By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. But one may argue that, we have atomic operations like atomicAdd(). If a program correctly sums up the same collection of numbers, although the order might be messed, it should always returns the same result. Sadly this is wrong, since some arithmetic operations DOES rely on the order of operands! Let’s take the following CUDA program as an example:

#include <stdio.h>

__global__ void sum(float *result) {
float i = (float)blockIdx.x;
atomicAdd(result, i / 10);

static const int N = 1000;

int main() {
float *cu_result;
cudaMalloc(&cu_result, sizeof(float));
float result;
int i;
for (i = 0; i < 10; i++) {
cudaMemset(cu_result, 0, sizeof(float));
sum<<<N, 1>>>(cu_result);
cudaMemcpy(&result, cu_result, sizeof(float), cudaMemcpyDeviceToHost);
printf("%f\n", result);
return 0;

In this simple program, we implement a parallelized summing function with atomicAdd(). We sum up $\{0.0,0.1,0.2,\ldots,99.8,99.9\}$ for 10 times and print out the results. The results would look like:


which is completely non-deterministic with some random small errors (the accurate answer should be 49950).

The key is that, floating-point arithemtics are non-associative. The equation $(a + b) + c = a + (b + c)$ does NOT hold for floating-point numbers $a, b, c$.

To understand this, let’s consider an extreme example – a very simple decimal floating-point representation with no fraction digits, i.e., numbers are represented as $\pm b \times 10^{e}$, where $b \in \{0 \ldots 9\}$. We will calculate $(0.8+1.5)-1.6$ and $0.8+(1.5-1.6)$ under this kind of “specification”. Below are their representation:

$$ \begin{aligned} 0.8 &\rightarrow 8 \times 10^{-1} \\ 1.5 &\rightarrow 2 \times 10^{0} \\ 1.6 &\rightarrow 2 \times 10^{0} \\ \end{aligned} $$

For $(0.8 + 1.5)-1.6$, we first shift the base of $8 \times 10^{-1}$ by 1 digit and get $0.8 \times 10^{0}$. Due to an underflow, it is truncated and becomes $0 \times 10^{0}$. Then sum up $0 \times 10^{0}$ (0.8) and $2 \times 10^{0}$ (1.5) and get $2 \times 10^{0}$. Further $2 \times 10^{0} - 2 \times 10^{0}$ (1.6), results in $0 \times 10^{0}$.

For $0.8 + (1.5-1.6)$, first calculate $2 \times 10^0$ (1.5) $-2 \times 10^0$ (1.6) and get $0 \times 10^{0}$. Then $8 \times 10^{-1}$ (0.8) adding with $0 \times 10^0$, results in $8 \times 10^{-1}$.

The example showcases a situation that floating-point underflow during summation could lead to extra numeric error, and further causes non-associativity in the arithemtic. It is simple, but can be generalized to the current IEEE 754 floating-point numbers. The operands’ order matters in floating-point arithemtics!

So here’s the answer. When running parallelized routines, GPUs have some innate randomness which comes from non-deterministic execution order. If a program consists of order-dependent operations, it will give non-deterministic output. Floating-point arithemtics, used widely in CUDA algorithms, are also such kind of operations, though less known by people. Atomic arithemtics solve the race condition problem, but do not guarantee the order.

If one requires determinism, the program should sequentialize the operations by using single thread or special synchronization, which will lead to a performance degrade. There is always a trade-off between determinism and performance.

Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.



A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to