Cython and Threads

Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents codes from different threads to access or manipulate the interpreter. The mechanism alleviates the risk of race condition, but makes multi-threading program “single-threaded” as well. Sadly, there’s no way to release the lock from pure Python.

OK. So what about not using pure Python? Shall we make an extension to bypass the mechanism? The answer is yes, and that’s what most of scientific libaries do.

As of writing extensions, Cython is a good choice, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a block using the with nogil: syntax. Will it better utilize multi-core CPU? Let’s have a try.

Read More

Obtain a Random Unused TCP Port with Bash

We might sometimes want to randomly choose an unused TCP port on Linux. This occurs from time to time on a server, when the administrator wants to expose an HTTP port for each user. Or, you just need an available port for IPC. Let’s make it happen with bash only.

The function below does the right job:

function unused_port() {
comm -23 \
<(seq "1025" "65535" | sort) \
<(ss -Htan |
awk '{print $4}' |
cut -d':' -f2 |
sort -u) |
shuf |
head -n "$N"

We are going to step through the code next.

Read More

Information Theory: KL Divergence

Assume there are two hypotheses $H_1$ and $H_2$, r.v. $X$ ranged in alphabets $a_1,\ldots,a_k$. Under hypothesis $H_i$, $X$ has pdf $p(X=a_j|H_i)=p_i(a_j)$. According to Law of Total Probability, we have:

$$ p(H_i|a_k) = \frac{p(H_i)p_i(a_k)}{p_1(a_k)p(H_1)+p_2(a_k)p(H_2)} $$

The formula can be transformed into:

$$ \log \frac{p_2(a_k)}{p_1(a_k)} = \log \frac{p(H_2|a_k)}{p(H_1|a_k)} - \log \frac{p(H_2)}{p(H_1)} $$

which implies that, $\log \frac{p_2(a_k)}{p_1(a_k)}$ equals the difference of log likelihood ratio before and after conditioning $X=a_k$. We define $\log \frac{p_2(a_k)}{p_1(a_k)}$ be the discrimination information for $H_2$ over $H_1$, when $X=a_k$. The expectation of discrimination information is KL divergence, denoted as:

$$D_{KL}(P_2||P_1) = \sum_k p_2(a_k) \log \frac{p_2(a_k)}{p_1(a_k)} $$

which sometimes denoted as $I(p2,p1;X)$, or simply $I(p2,p1)$ if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for $X$ gained after distribution shifted from $p_1$ to $p_2$, where $p_1$ and $p_2$ regarded as prior and post-prior distributions.

Read More

Information Theory: Entropy and Mutual Information

Given a discrete r.v. $X$, where $X$ ranged in ${a_1, \ldots, a_n}$, $\mathbb{P}(X=a_k)=p_k$. Entropy $H(X)$ is defined as:

$$H(X)= - \sum_k p_k \log p_k$$

When regarded as a function of ${p_k}$, entropy satisfies the following properties:

  1. $H(p_1,\ldots,p_n)$ is continuous, and non-negative;
  2. $H(p_1,\ldots,p_n)$ is convex w.r.t. $(p_1,\ldots,p_n)$;
  3. $H(p_1,\ldots,p_n)$ has a unique maxima $(\frac{1}{n},\ldots,\frac{1}{n})$;
  4. $H(n):=H(\frac{1}{n},\ldots,\frac{1}{n})$ increases along with $n$;
  5. $H(p_1,\ldots,p_n)=H(p_1+\ldots+p_k,p_{k+1},\ldots,p_n)+(p_1+\ldots+p_k)H(p_{k+1}’,\ldots,p_n’)$.

Property 5 is so-called addictivity. That is, if we observe $X$ in two steps, firstly obtaining a value from ${\hat{a},a_{k+1},\ldots,a_n}$ and then another value from ${a_1,\ldots,a_k}$ if $\hat{a}$ selected, the entropy of the whole system should be sum of these two subsystems.

Note that a function satisfying property 1, 4, 5 must have a form of $H(\vec{p})= - C \sum_k p_k \log p_k$, which reveals that entropy function is unique.

Entropy measures the uncertainty of a random value. Intuitively, entropy reaches its maximum $\log n$ when all alphabets occur with same probability, and likewise has a minimum of $0$ if $p_k=1$ for some $k$.

Entropy also represents the smallest average length to encode a message. Say we have a message consisting of alphabets $a_1,\ldots,a_n$, occurring with probability $p_1,\ldots,p_n$. Now we want to assign a code (an $N$-ary string) to each alphabet, with no two codes sharing a same prefix. The length of the codes are denoted as $l_1,\ldots,l_n$. Shannon’s source coding theroem states that the average code length $\sum_k p_k l_k$ could not be less than $H(p_1,\ldots,p_n)$ (taking $N$ as logarithm base).

Read More










Read More

Proof of the Gumbel Max Trick


Assume that $\alpha_1, \alpha_2, \ldots, \alpha_n$ satisify $\sum_k\alpha_k=1$. Define


where $G_1,\ldots,G_n \text{ i.i.d.}\sim Gumbel(0,1)$, whose PDF and CDF are defined as

$$\begin{align} f(x)&=e^{-(x+e^{-x})} \\ F(x)&=e^{-e^{-x}}\end{align}$$

. Then $\mathbb{P}(Z=k)=\alpha_k$.


Set $u_k=\log{\alpha_k}+G_k$. We prove by direct calculations.

$$\begin{align} \mathbb{P}(Z=k)&=\mathbb{P}(u_k \geq u_j,\forall j \neq k) \\ &=\int_{-\infty}^\infty \mathbb{P}(u_k \geq u_j, \forall j \neq k|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}\mathbb{P}(u_k \geq u_j|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}e^{-e^{-u_k+\log \alpha_j}} e^{-(u_k-\log\alpha_k+e^{-(u_k-\log\alpha_k)})} du_k \\ &=\int_{-\infty}^\infty e^{-\sum_{j\neq k}\alpha_je^{-u_k}} \alpha_k e^{-(u_k+\alpha_k e^{-u_k})} du_k \\ &=\alpha_k \int_{-\infty}^\infty e^{-u_k-(\alpha_k+\sum_{j\neq k}\alpha_j)e^{-u_k}} du_k \\ &= \alpha_k \end{align}$$.


The trick is commonly used in DL to make sampling over a discrete distribution differentiable.



Let’s consider the following function:

use std::ptr::NonNull;

fn transform<T>(option: &Option<NonNull<T>>) -> Option<&T> {|x| unsafe { x.as_ref() })

The function transform takes an Option<NonNull<T>> as input, and convert the inner pointer to immutable reference &T if available. The method NonNull::as_ref() is marked unsafe so we need an unsafe block. The snippet causes an compilation error:

Read More

Rc and RefCell tricks

Say we need a type Cursor<T> , which contains a mutable reference to T . The type has a method, say .dup() , which returns a new instance of Cursor<T> , while sharing the interior reference across instances. Such design is a common pattern in database driver library. Users might expect to hold multiple cursors simultaneously, each of which owns a reference to the same connection object.

A possible implementation might look like:

struct Cursor<'a, T> {
obj: &'a mut T,

impl<'a, T> Cursor<'a, T> {
fn new(t: &'a mut T) -> Cursor<'a, T> {
Cursor { obj: t }

fn dup(&mut self) -> Cursor<T> {
Cursor { obj: self.obj }

fn main() {
let mut i = 1;
let mut cursor_a = Cursor::new(&mut i);
let _cursor_b = cursor_a.dup();

Perfect and neat, and luckily Rust compiler did not complain. As a freshman in Rust, it’s not easy to keep compiler silent all the time, especially when playing with references. But once I try to chain up the constructor and .dup() , the compiler screams:

Read More





Read More