Where does the randomness of CUDA kernels come from?

You might have heard that many CUDA operators have some kind of non-determinism, and to eliminate the randomness, one must pay for the degradation of performance. We have been warned many times by blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. This post is going to explore the problem.

When talked about GPU computation, one may have an impression of some super-fast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (by contrast the number is dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from non-deterministic execution order and data race condition.

Non-deterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across executions. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.

Data race condition is one of the consequences of non-deterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. CUDA supports provides atomic arithmetic routines like atomicAdd() or atomicMax() for safe access to shared memory.

By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. One may argue that, we have atomic operations like atomicAdd(), and if the program correctly sums up the same collection of numbers, although the order might not be guaranteed, it should always returns the same result. But this is wrong – some arithmetic operations does rely on the order of operands. Take the following CUDA program as an example:

Read More

Performant Bulk Mutations in IndexedDB

IndexedDB seems to be inefficient when performing bulk mutations, like dumping a huge list of items into an object store – at least I think so at the first sight on the MDN docs. It provides no explicit API for the job as SQL does (you may add as many records after the INSERT INTO ... VALUES clause), so all we can do is to use a loop from client, and thus cannot benefit from database internal optimization (if there’s any). The mutation requests, in addition, appear to be spawned sequentially only. The tutorial recommends a paradigm to raise a request within the success event callback of the previous request, which is in fact a sequential execution. Such code will be definitely slow.

We may conduct a quick benchmark on the above approach:

;(async () => {
await new Promise((resolve) => {
const r = indexedDB.deleteDatabase("test")
r.onsuccess = r.onerror = resolve
const items = Array.from({ length: 100000 }, (_, i) => ({ id: i }))
const store = await new Promise((resolve) => {
indexedDB.open("test", 1).onupgradeneeded = (event) => {
const db = event.target.result
const store = db.createObjectStore("store", { keyPath: "id" })
store.createIndex("id", "id")
await bulkAdd(store, items)

function bulkAdd(store, items) {
const failures = []
return new Promise((resolve) => {
function _perform(idx) {
const req = store.add(items[idx])
req.onsuccess = (event) => {
if (idx === items.length - 1) resolve(failures)
else _perform(idx + 1)
req.onerror = (event) => {

Practically, we concern more about failed records than the ones inserted successfully. We thus only take down the indices of those records, which improves the efficiency at least a little bit.

The timing is rather unstable, but on average, it takes 30~40 seconds to insert 100k records or 2000~3000 records per second, which is not an ideal speed.

Read More

Auto Rebuild .pyx Files with pyximport

Modules written in Cython usually comes with a setup.py script, compiling the Cython source codes into native shared libary. For whom not so familiar with Python’s packaging and distributing toolchains, such step is sometimes forbidding, and turns out to be a stumbling block for Cython freshmen. Moreover, the workflow, “run setup.py -> debug -> edit .pyx files -> run setup.py”, is also not convenient and troublesome for fast iterating projects.

pyximport is a handy tool from Cython official, to address the above problem. The module enables users to “directly import” .pyx files, with no explicit setup.py required. Let’s take an example here. Say we have two files residing in the same directory:

# main.py
import pyximport


import foo

# foo.pyx
cpdef int sqr(int x):
return x * x

The magically highlighted line, registers some import hooks to let Python recognize .pyx files. pyximport automatically compiles or re-compiles the .pyx files behind the scene, when they are imported for the first time, or modified in the future.

Read More

Cython and Threads

Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents codes from different threads to access or manipulate the interpreter. The mechanism alleviates the risk of race condition, but makes multi-threading program “single-threaded” as well. Sadly, there’s no way to release the lock from pure Python.

OK. So what about not using pure Python? Shall we make an extension to bypass the mechanism? The answer is yes, and that’s what most of scientific libaries do.

As of writing extensions, Cython is a good choice, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a block using the with nogil: syntax. Will it better utilize multi-core CPU? Let’s have a try.

Read More

Obtain a Random Unused TCP Port with Bash

We might sometimes want to randomly choose an unused TCP port on Linux. This occurs from time to time on a server, when the administrator wants to expose an HTTP port for each user. Or, you just need an available port for IPC. Let’s make it happen with bash only.

The function below does the right job:

function unused_port() {
comm -23 \
<(seq "1025" "65535" | sort) \
<(ss -Htan |
awk '{print $4}' |
cut -d':' -f2 |
sort -u) |
shuf |
head -n "$N"

We are going to step through the code next.

Read More

Information Theory: KL Divergence

Assume there are two hypotheses $H_1$ and $H_2$, r.v. $X$ ranged in alphabets $a_1,\ldots\,a_k$. Under hypothesis $H_i$, $X$ has pdf $p(X=a_j|H_i)=p_i(a_j)$. According to Law of Total Probability, we have:

$$ p(H_i|a_k) = \frac{p(H_i)p_i(a_k)}{p_1(a_k)p(H_1)+p_2(a_k)p(H_2)} $$

The formula can be transformed into:

$$ \log \frac{p_2(a_k)}{p_1(a_k)} = \log \frac{p(H_2|a_k)}{p(H_1|a_k)} - \log \frac{p(H_2)}{p(H_1)} $$

which implies that, $\log \frac{p_2(a_k)}{p_1(a_k)}$ equals the difference of log likelihood ratio before and after conditioning $X=a_k$. We define $\log \frac{p_2(a_k)}{p_1(a_k)}$ be the discrimination information for $H_2$ over $H_1$, when $X=a_k$. The expectation of discrimination information is KL divergence, denoted as:

$$D_{KL}(P_2||P_1) = \sum_k p_2(a_k) \log \frac{p_2(a_k)}{p_1(a_k)} $$

which sometimes denoted as $I(p2,p1;X)$, or simply $I(p2,p1)$ if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for $X$ gained after distribution shifted from $p_1$ to $p_2$, where $p_1$ and $p_2$ regarded as prior and post-prior distributions.

Read More

Information Theory: Entropy and Mutual Information

Given a discrete r.v. $X$, where $X$ ranged in $\{a_1, \ldots, a_n\}$, $\mathbb{P}(X=a_k)=p_k$. Entropy $H(X)$ is defined as:

$$H(X)= - \sum_k p_k \log p_k$$

When regarded as a function of $\{p_k\}$, entropy satisfies the following properties:

  1. $H(p_1,\ldots,p_n)$ is continuous, and non-negative;
  2. $H(p_1,\ldots,p_n)$ is convex w.r.t. $(p_1,\ldots,p_n)$;
  3. $H(p_1,\ldots,p_n)$ has a unique maxima $(\frac{1}{n},\ldots,\frac{1}{n})$;
  4. $H(n):=H(\frac{1}{n},\ldots,\frac{1}{n})$ increases along with $n$;
  5. $H(p_1,\ldots,p_n)=H(p_1+\ldots+p_k,p_{k+1},\ldots,p_n)+(p_1+\ldots+p_k)H(p_{k+1}',\ldots,p_n')$.

Property 5 is so-called addictivity. That is, if we observe $X$ in two steps, firstly obtaining a value from $\{\hat{a},a_{k+1},\ldots,a_n\}$ and then another value from $\{a_1,\ldots,a_k\}$ if $\hat{a}$ selected, the entropy of the whole system should be sum of these two subsystems.

Note that a function satisfying property 1, 4, 5 must have a form of $H(\vec{p})= - C \sum_k p_k \log p_k$, which reveals that entropy function is unique.

Entropy measures the uncertainty of a random value. Intuitively, entropy reaches its maximum $\log n$ when all alphabets occur with same probability, and likewise has a minimum of $0$ if $p_k=1$ for some $k$.

Entropy also represents the smallest average length to encode a message. Say we have a message consisting of alphabets $a_1,\ldots,a_n$, occurring with probability $p_1,\ldots,p_n$. Now we want to assign a code (an $N$-ary string) to each alphabet, with no two codes sharing a same prefix. The length of the codes are denoted as $l_1,\ldots,l_n$. Shannon’s source coding theroem states that the average code length $\sum_k p_k l_k$ could not be less than $H(p_1,\ldots,p_n)$ (taking $N$ as logarithm base).

Read More










Read More

Proof of the Gumbel Max Trick


Assume that $\alpha_1, \alpha_2, \ldots, \alpha_n$ satisify $\sum_k\alpha_k=1$. Define


where $G_1,\ldots,G_n \text{ i.i.d.}\sim Gumbel(0,1)$, whose PDF and CDF are defined as

$$\begin{align} f(x)&=e^{-(x+e^{-x})} \\ F(x)&=e^{-e^{-x}}\end{align}$$

. Then $\mathbb{P}(Z=k)=\alpha_k$.


Set $u_k=\log{\alpha_k}+G_k$. We prove by direct calculations.

$$\begin{align} \mathbb{P}(Z=k)&=\mathbb{P}(u_k \geq u_j,\forall j \neq k) \\ &=\int_{-\infty}^\infty \mathbb{P}(u_k \geq u_j, \forall j \neq k|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}\mathbb{P}(u_k \geq u_j|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}e^{-e^{-u_k+\log \alpha_j}} e^{-(u_k-\log\alpha_k+e^{-(u_k-\log\alpha_k)})} du_k \\ &=\int_{-\infty}^\infty e^{-\sum_{j\neq k}\alpha_je^{-u_k}} \alpha_k e^{-(u_k+\alpha_k e^{-u_k})} du_k \\ &=\alpha_k \int_{-\infty}^\infty e^{-u_k-(\alpha_k+\sum_{j\neq k}\alpha_j)e^{-u_k}} du_k \\ &= \alpha_k \end{align}$$.


The trick is commonly used in DL to make sampling over a discrete distribution differentiable.