## Information Theory: KL Divergence

Assume there are two hypotheses $H_1$ and $H_2$, r.v. $X$ ranged in alphabets $a_1,\ldots\,a_k$. Under hypothesis $H_i$, $X$ has pdf $p(X=a_j|H_i)=p_i(a_j)$. According to Law of Total Probability, we have:

$$p(H_i|a_k) = \frac{p(H_i)p_i(a_k)}{p_1(a_k)p(H_1)+p_2(a_k)p(H_2)}$$The formula can be transformed into:

$$\log \frac{p_2(a_k)}{p_1(a_k)} = \log \frac{p(H_2|a_k)}{p(H_1|a_k)} - \log \frac{p(H_2)}{p(H_1)}$$which implies that, $\log \frac{p_2(a_k)}{p_1(a_k)}$ equals the difference of log likelihood ratio before and after conditioning $X=a_k$. We define $\log \frac{p_2(a_k)}{p_1(a_k)}$ be the discrimination information for $H_2$ over $H_1$, when $X=a_k$. The expectation of discrimination information is KL divergence, denoted as:

$$D_{KL}(P_2||P_1) = \sum_k p_2(a_k) \log \frac{p_2(a_k)}{p_1(a_k)}$$which sometimes denoted as $I(p2,p1;X)$, or simply $I(p2,p1)$ if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for $X$ gained after distribution shifted from $p_1$ to $p_2$, where $p_1$ and $p_2$ regarded as prior and post-prior distributions.

## Information Theory: Entropy and Mutual Information

Given a discrete r.v. $X$, where $X$ ranged in $\{a_1, \ldots, a_n\}$, $\mathbb{P}(X=a_k)=p_k$. Entropy $H(X)$ is defined as:

$$H(X)= - \sum_k p_k \log p_k$$When regarded as a function of $\{p_k\}$, entropy satisfies the following properties:

1. $H(p_1,\ldots,p_n)$ is continuous, and non-negative;
2. $H(p_1,\ldots,p_n)$ is convex w.r.t. $(p_1,\ldots,p_n)$;
3. $H(p_1,\ldots,p_n)$ has a unique maxima $(\frac{1}{n},\ldots,\frac{1}{n})$;
4. $H(n):=H(\frac{1}{n},\ldots,\frac{1}{n})$ increases along with $n$;
5. $H(p_1,\ldots,p_n)=H(p_1+\ldots+p_k,p_{k+1},\ldots,p_n)+(p_1+\ldots+p_k)H(p_{k+1}',\ldots,p_n')$.

Property 5 is so-called addictivity. That is, if we observe $X$ in two steps, firstly obtaining a value from $\{\hat{a},a_{k+1},\ldots,a_n\}$ and then another value from $\{a_1,\ldots,a_k\}$ if $\hat{a}$ selected, the entropy of the whole system should be sum of these two subsystems.

Note that a function satisfying property 1, 4, 5 must have a form of $H(\vec{p})= - C \sum_k p_k \log p_k$, which reveals that entropy function is unique.

Entropy measures the uncertainty of a random value. Intuitively, entropy reaches its maximum $\log n$ when all alphabets occur with same probability, and likewise has a minimum of $0$ if $p_k=1$ for some $k$.

Entropy also represents the smallest average length to encode a message. Say we have a message consisting of alphabets $a_1,\ldots,a_n$, occurring with probability $p_1,\ldots,p_n$. Now we want to assign a code (an $N$-ary string) to each alphabet, with no two codes sharing a same prefix. The length of the codes are denoted as $l_1,\ldots,l_n$. Shannon’s source coding theroem states that the average code length $\sum_k p_k l_k$ could not be less than $H(p_1,\ldots,p_n)$ (taking $N$ as logarithm base).

## Statement

Assume that $\alpha_1, \alpha_2, \ldots, \alpha_n$ satisify $\sum_k\alpha_k=1$. Define

$$Z=\arg\max_k\{\log\alpha_k+G_k\}$$where $G_1,\ldots,G_n \text{ i.i.d.}\sim Gumbel(0,1)$, whose PDF and CDF are defined as

\begin{align} f(x)&=e^{-(x+e^{-x})} \\ F(x)&=e^{-e^{-x}}\end{align}. Then $\mathbb{P}(Z=k)=\alpha_k$.

## Proof

Set $u_k=\log{\alpha_k}+G_k$. We prove by direct calculations.

\begin{align} \mathbb{P}(Z=k)&=\mathbb{P}(u_k \geq u_j,\forall j \neq k) \\ &=\int_{-\infty}^\infty \mathbb{P}(u_k \geq u_j, \forall j \neq k|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}\mathbb{P}(u_k \geq u_j|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}e^{-e^{-u_k+\log \alpha_j}} e^{-(u_k-\log\alpha_k+e^{-(u_k-\log\alpha_k)})} du_k \\ &=\int_{-\infty}^\infty e^{-\sum_{j\neq k}\alpha_je^{-u_k}} \alpha_k e^{-(u_k+\alpha_k e^{-u_k})} du_k \\ &=\alpha_k \int_{-\infty}^\infty e^{-u_k-(\alpha_k+\sum_{j\neq k}\alpha_j)e^{-u_k}} du_k \\ &= \alpha_k \end{align}.

## Application

The trick is commonly used in DL to make sampling over a discrete distribution differentiable.

## Option::as_ref

Let’s consider the following function:

The function transform takes an Option<NonNull<T>> as input, and convert the inner pointer to immutable reference &T if available. The method NonNull::as_ref() is marked unsafe so we need an unsafe block. The snippet causes an compilation error:

## Rc and RefCell tricks

Say we need a type Cursor<T> , which contains a mutable reference to T . The type has a method, say .dup() , which returns a new instance of Cursor<T> , while sharing the interior reference across instances. Such design is a common pattern in database driver library. Users might expect to hold multiple cursors simultaneously, each of which owns a reference to the same connection object.

A possible implementation might look like:

Perfect and neat, and luckily Rust compiler did not complain. As a freshman in Rust, it’s not easy to keep compiler silent all the time, especially when playing with references. But once I try to chain up the constructor and .dup() , the compiler screams:

## SS Client

/etc/ss.json :

## proxychains

clone repository from https://github.com/rofl0r/proxychains-ng , make && sudo make install .

Append following lines to /etc/proxychains.conf :

Usage: proxychains [command] .

SwitchyOmega .

return 是平凡的：

fmap 可以作如下定义：

## 动机

pure functions 看似完美，但却不能模拟现实世界中的诸多任务。这是由于 pure functions 是良定的映射，对于特定的输入值会返回唯一的输出。这种模式在面对如下任务时会显得苍白无力：

• 有可能失败的任务。如大多数的 IO。
• 依赖外部状态的任务。如（伪）随机数生成器。
• 非确定性任务，即对于确定的输入可能有多个输出。这种在 IP 中较为少见。
• 对外界会造成影响的任务。如大多数的写入过程。

## 域扩充

• 如果 f 有可能失败，我们可以将 B 扩充为 Err(B) ∪ { reasons of failures }，其中 reasons of failures 可能是对异常的描述，也可以是空值一类的东西。则 f': A -> Err(B) 是良定的映射，且与 f 行为一致。事实上，这就是 Maybe Monad 和 Either Monad。
• 如果 f 依赖于外部状态，我们定义 Pref(B)从外部状态空间到 B 的映射的全体，则 f': A -> Pref(B) 为良定的映射，且行为和 f 一致。换言之，对于特定的输入 af'(a) 返回一个函数，其中蕴含了已知 a 时如何从各种不同状态得到结果的逻辑。事实上，这就是 State Monad。
• 如果 f 具有非确定性，我们将 B 扩充为 Power(B)，即 B 的幂集。则 f': A -> Power(B) 为良定的映射，且行为与 f 一致。事实上，这就是 List Monad。
• 如果 f 依赖于真实世界，我们将 B 扩充为 IO(B)，其中的元素为一些值域为 B伪函数，可能对真实世界有影响。这些伪函数已经脱离了 pure functions 的范畴，但将它们看成元素是没有问题的。如此一来 f': A -> IO(B) 为良定的映射，且行为与 f 一致。事实上，这就是 IO Monad。

• a -> m b 型函数为 monadic function
• a -> b 型函数为 non-monadic function

Functor 的假设是容易成立的。当然我们可以定义多个不同的 fmap，如此产生的 Monad 会有不同的语义。join 的假设也是容易成立的，m (m a) 通常和 m a 包含相同多的信息。故此做法是实际可行的。

• 是 Functor，存在 fmap :: (a -> b) -> m a -> m b
• 有一个平凡的拆箱操作 join :: m (m a) -> m a
• 有一个平凡的装箱操作 return :: a -> m a

1. return . f == (fmap f) . returnreturn 的平凡性）
2. join . fmap (fmap f) == (fmap f) . joinjoin 的平凡性）

• join . (fmap join) == join . join
• join . fmap return == join . return == id

## Motivation

Functor solves the problem of mapping regular one-parameter functions into a sub-category, but that’s not easy for functions with more than one parameters.

Let’s consider a function with two parameters f :: a -> b -> c, which can also read as a -> (b -> c). Applying fmap on f, we will get fmap f :: m a -> m (b -> c). There’s still some distance from what we want: f' :: m a -> m b -> m c. To get f', we need a transform from m (b -> c) to m b -> m c. Here we denote it as <*> :: m （b -> c) -> m b -> m c. We will later show that such transform is universal for functions with more parameters.

Now consider a function with three parameters f :: a -> b -> c -> d. We are going to transform it into a wrapped-value version, with the help of fmap and <*>.

Here \a_ b_ c_ -> ((fmap f a_) <*> b_) <*> c_ is in the desired type. For most of the time, applying parameters directly is actually what we want, instead of the function itself, so the code could simply be written as ((fmap f a) <*> b) <*> c, where a, b and c are wrapped values. Parenthesis could be omitted if precedences are set properly, which leads to a neat and easy-to-read form:

In haskell, fmap has an infix name <$>. So finally we get: f <$> a <*> b <*> c.

## Applicative

Haskell pre-defined a type class Applicative, which captures the pattern <*>. Any type that implements Applicative works well with <\$> and <*>.

Note that an Applicative is also a Functor. Apart from <*>, there are some other helper functions or operators in Applicative.

pure is equivalent to the default value constructor of f, e.g. (:[]) for List or Just for Maybe. This may be handful when lifting an unwrapped value to a wrapped one.

liftA2 transforms a binary operator to the corresponding version. The function exists as binary operators would be frequently passed among high-order functions.

*> takes two wrapped parameters and simply returns the second one, which sequence up two wrapped values. This is quite useful for Applicative with action semantics, such as IO. In fact, it’s so useful that Haskell introduces a syntax sugar for it, known as the do-notation. Particularly:

is equivalent to

<* is similar. Both will be reviewed while studying Monad.

## Category Theory

A category has three components:

• A collection of objects.
• A collection of morphisms, each of which map one object to another.
• A notion of composition of these morphisms, i.e. morphisms can be composed. If f: A -> B and g: B -> C are morphisms, f.g generates a new morphism A -> C.

Note that a morphism has no specific semantics of mapping, but simply links two objects together. Morphisms are also called Arrows.

### Examples

#### Set Category: Set

All sets and standard functions form a category. Functions need not to be surjective, since morphisms have no mapping semantics.

#### Group Category: Grp

All groups and homomorphisms between groups form a category. A group has specific algebaric structure, which morphisms should preserve.

### Laws

Three laws needed to be followed for a category:

• Composition should be associative.
• Composition operation should be closed in the category, i.e. if f: A -> B and g: B -> C, there must be a h: A -> C satisfying h = f . g.
• For each object A, there should exist an identity morphism id(A): A -> A s.t. for every f: A -> B, f = id(A) . f = f . id(B).

Note that:

• There may exist serveral morphisms between A and B.
• An identity has type A -> A, but a morphism with such type needs not to be an identity.

### Functors in Category Theory

A functor maps a category to another category. It should contains two mappings for objects and for morphisms, with composition operation and category laws preserved.

There’s a natural functor from Grp to Set, which maps groups to their underlying sets and group morphisms to the function which behave the same but are defined on sets instead of groups.

It’s common to create new types that hold values of other types. List[a] type constructor creates types that holds sequential values of same type; Maybe[a] creates types that hold operation states (failure, or success with returned values).

Usually we want generated types to inherit functions from types being wrapped. For example, List[Int] should have element-wise addition as Int does, and Maybe[Int] should have similar operations with no burden of re-wrapping and unwrapping. Such ‘inheritance’ is only concerned with the structure of types, instead of specific functions, so should be done automatically if possible.

Haskell language itself forms a category, with all types being objects, and functions being morphisms. Such category is called Hask.

A parameteric type implementing class Functor is a category functor mapping Hask to one of its sub-category, where types m a are the object collection. The type constructor m maps objects, and specific fmap defined on m maps corresponding functions.
It’s worth noted that (a -> b) -> m a -> m b can also read as (a -> b) -> (m a -> m b), as -> is right-associative. This may provide a clearer view of fmap, which takes a regular function in Hask and returns the corresponding function in sub-category.
fmap (+) :: List[Int] -> List[Int] generates element-wise addition in List[Int].
fmap (+) :: Maybe Int -> Maybe Int generates such function: