Information Theory: Entropy and Mutual Information

en

Given a discrete r.v. $X$, where $X$ ranged in $\{a_1, \ldots, a_n\}$, $\mathbb{P}(X=a_k)=p_k$. Entropy $H(X)$ is defined as:

$$H(X)= - \sum_k p_k \log p_k$$

When regarded as a function of $\{p_k\}$, entropy satisfies the following properties:

  1. $H(p_1,\ldots,p_n)$ is continuous, and non-negative;
  2. $H(p_1,\ldots,p_n)$ is convex w.r.t. $(p_1,\ldots,p_n)$;
  3. $H(p_1,\ldots,p_n)$ has a unique maxima $(\frac{1}{n},\ldots,\frac{1}{n})$;
  4. $H(n):=H(\frac{1}{n},\ldots,\frac{1}{n})$ increases along with $n$;
  5. $H(p_1,\ldots,p_n)=H(p_1+\ldots+p_k,p_{k+1},\ldots,p_n)+(p_1+\ldots+p_k)H(p_{k+1}',\ldots,p_n')$.

Property 5 is so-called addictivity. That is, if we observe $X$ in two steps, firstly obtaining a value from $\{\hat{a},a_{k+1},\ldots,a_n\}$ and then another value from $\{a_1,\ldots,a_k\}$ if $\hat{a}$ selected, the entropy of the whole system should be sum of these two subsystems.

Note that a function satisfying property 1, 4, 5 must have a form of $H(\vec{p})= - C \sum_k p_k \log p_k$, which reveals that entropy function is unique.

Entropy measures the uncertainty of a random value. Intuitively, entropy reaches its maximum $\log n$ when all alphabets occur with same probability, and likewise has a minimum of $0$ if $p_k=1$ for some $k$.

Entropy also represents the smallest average length to encode a message. Say we have a message consisting of alphabets $a_1,\ldots,a_n$, occurring with probability $p_1,\ldots,p_n$. Now we want to assign a code (an $N$-ary string) to each alphabet, with no two codes sharing a same prefix. The length of the codes are denoted as $l_1,\ldots,l_n$. Shannon’s source coding theroem states that the average code length $\sum_k p_k l_k$ could not be less than $H(p_1,\ldots,p_n)$ (taking $N$ as logarithm base).

READ MORE

Proof of the Gumbel Max Trick

en

Statement

Assume that $\alpha_1, \alpha_2, \ldots, \alpha_n$ satisify $\sum_k\alpha_k=1$. Define

$$Z=\arg\max_k\{\log\alpha_k+G_k\}$$

where $G_1,\ldots,G_n \text{ i.i.d.}\sim Gumbel(0,1)$, whose PDF and CDF are defined as

$$\begin{align} f(x)&=e^{-(x+e^{-x})} \\ F(x)&=e^{-e^{-x}}\end{align}$$

. Then $\mathbb{P}(Z=k)=\alpha_k$.

Proof

Set $u_k=\log{\alpha_k}+G_k$. We prove by direct calculations.

$$\begin{align} \mathbb{P}(Z=k)&=\mathbb{P}(u_k \geq u_j,\forall j \neq k) \\ &=\int_{-\infty}^\infty \mathbb{P}(u_k \geq u_j, \forall j \neq k|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}\mathbb{P}(u_k \geq u_j|u_k)\mathbb{P}(u_k) du_k \\ &=\int_{-\infty}^\infty \prod_{j\neq k}e^{-e^{-u_k+\log \alpha_j}} e^{-(u_k-\log\alpha_k+e^{-(u_k-\log\alpha_k)})} du_k \\ &=\int_{-\infty}^\infty e^{-\sum_{j\neq k}\alpha_je^{-u_k}} \alpha_k e^{-(u_k+\alpha_k e^{-u_k})} du_k \\ &=\alpha_k \int_{-\infty}^\infty e^{-u_k-(\alpha_k+\sum_{j\neq k}\alpha_j)e^{-u_k}} du_k \\ &= \alpha_k \end{align}$$.

Application

The trick is commonly used in DL to make sampling over a discrete distribution differentiable.

References

READ MORE

Option::as_ref

en

Let’s consider the following function:

use std::ptr::NonNull;

fn transform<T>(option: &Option<NonNull<T>>) -> Option<&T> {
option.map(|x| unsafe { x.as_ref() })
}

The function transform takes an Option<NonNull<T>> as input, and converts the inner pointer to an immutable reference &T if possible. The method NonNull::as_ref() is marked unsafe so we need an unsafe block. The snippet causes an compilation error:

READ MORE

Rc, RefCell and Interior Mutability

en

Say we need a type Cursor<T> , which holds a mutable reference to T. A method .dup() duplicates the internal reference, wraps it in a new instance of Cursor<T> and returns. Such pattern exists commonly in database driver library. Users could hold multiple cursors simultaneously, with each owning a (mutable) reference to the same connection object.

One might implements with a primitive mutable reference:

struct Cursor<'a, T> {
obj: &'a mut T,
}

impl<'a, T> Cursor<'a, T> {
fn new(t: &'a mut T) -> Cursor<'a, T> {
Cursor { obj: t }
}

fn dup(&mut self) -> Cursor<T> {
Cursor { obj: self.obj }
}
}

fn main() {
let mut i = 1;
let mut cursor_a = Cursor::new(&mut i);
let _cursor_b = cursor_a.dup();
}

Perfect and neat, and luckily Rust compiler did not complain. Fresh Rustanceans would have to work hard for shutting up the compiler, especially when fighting with references.

The invocation of ::new() and .dup() are on separate lines. Now what about to chain up the constructor and .dup()? This time the compiler fails:

READ MORE

Haskell 笔记:State Monad

zh

一个依赖于外部状态 s 的伪函数 f' :: a -> b,我们可以将其改写为 f :: a -> s -> (b, s) 使其良定。即,在输入输出中显式传递状态 s。现在,我们需要利用 Monad 将状态传递过程隐藏起来。

注意到,输出值 (b, s) 中的末状态 s 不仅依赖于输入状态,更依赖于之前更改过状态的一系列函数及其逻辑。因此我们不能简单地将 Monad 定义为 (a, s) 类似的形式,否则两个函数用 >=> 结合的结果将与函数逻辑无关,这与我们的期望不符。

考虑如下定义:

newtype State s a = { runState :: s -> (a, s) }

由于 -> 的右结合性,f :: a -> s -> (b, s)f :: a -> State s b 等价。固定 s,则 State s 可以成为一个 Monad。一个类型为 State s a 的值通常也被称为一个 state processor。

现在尝试定义 (>>=) :: State s a -> (a -> State s b) -> State s b。若 p >>= f,则 p 蕴含了在此之前所有的状态处理逻辑,我们希望将 pf 的逻辑融合在一起,成为一个新的 state processor,并作为返回值。

p >>= f = 
(
State $ \s -> (b, s'')
where
(a, s') = (runState p) s
p2 = f a -- :: State s b
(b, s'') = (runState p2) s'
)

return 是平凡的:

return a = State $ (\s -> (a, s))

fmap 可以作如下定义:

fmap :: (a -> b) -> (State s a) -> (State s b)
fmap f =
(
\pIn -> (
\s -> (b, s')
where
(a, s') = (runState pIn) s
b = f a
)

如此一来,我们可以将一系列的依赖外部状态的函数串成一个依赖外部状态的函数,传以初始状态,便可得到结果。

READ MORE

Haskell 笔记:Monad 引论

zh

动机

pure functions 看似完美,但却不能模拟现实世界中的诸多任务。这是由于 pure functions 是良定的映射,对于特定的输入值会返回唯一的输出。这种模式在面对如下任务时会显得苍白无力:

  • 有可能失败的任务。如大多数的 IO。
  • 依赖外部状态的任务。如(伪)随机数生成器。
  • 非确定性任务,即对于确定的输入可能有多个输出。这种在 IP 中较为少见。
  • 对外界会造成影响的任务。如大多数的写入过程。

这些问题可以用数学中的域扩充技巧来解决。

域扩充

在数学中,当定义问题的范畴不足以容纳问题的解时,我们通常会对相关的范畴进行扩充。类似的技巧同样也可以应用在这里。

假设一个不良定的函数 f: A -> B

  • 如果 f 有可能失败,我们可以将 B 扩充为 Err(B) ∪ { reasons of failures },其中 reasons of failures 可能是对异常的描述,也可以是空值一类的东西。则 f': A -> Err(B) 是良定的映射,且与 f 行为一致。事实上,这就是 Maybe Monad 和 Either Monad。
  • 如果 f 依赖于外部状态,我们定义 Pref(B)从外部状态空间到 B 的映射的全体,则 f': A -> Pref(B) 为良定的映射,且行为和 f 一致。换言之,对于特定的输入 af'(a) 返回一个函数,其中蕴含了已知 a 时如何从各种不同状态得到结果的逻辑。事实上,这就是 State Monad。
  • 如果 f 具有非确定性,我们将 B 扩充为 Power(B),即 B 的幂集。则 f': A -> Power(B) 为良定的映射,且行为与 f 一致。事实上,这就是 List Monad。
  • 如果 f 依赖于真实世界,我们将 B 扩充为 IO(B),其中的元素为一些值域为 B伪函数,可能对真实世界有影响。这些伪函数已经脱离了 pure functions 的范畴,但将它们看成元素是没有问题的。如此一来 f': A -> IO(B) 为良定的映射,且行为与 f 一致。事实上,这就是 IO Monad。

以上操作都有一个共同点,即对一个不良定函数的值域做了扩充,使之变成良定函数。如果用 Haskell 语言描述,它们都有相似的型:f :: a -> m b,其中 m 为扩充规则。

一个问题随之而来:这样的新函数该怎么结合?为此我们要对相关逻辑进行抽象。这就是 Monad。

Monad

这里我们尝试从实际需求出发,导出一个 Type Constructor 成为 Monad 的必要条件。

约定两个名称:

  • a -> m b 型函数为 monadic function
  • a -> b 型函数为 non-monadic function

首先需要解决的是 monadic functions 如何结合的问题。这个问题具有重要的现实意义。monadic function 常常代表某种计算任务,它们之间的结合相当于把若干计算任务串行化,而后者是非常常见的需求。

我们希望有一种运算符有如下的类型 (b -> m c) -> (a -> m b) -> (a -> m c),在此记为 >=> (因其形状,常被叫做 fish operator)。一个自然的想法是,Monad m 需要某种平凡的拆箱操作 extract' :: m a -> a。所谓“平凡”,即 extract' 不应该丢失参数的任何信息。但这往往不能实现,因为 m a 通常会比 a 包含更多的信息,导致 extract' 无法构成良定的映射。例如 Maybe a 中的值 Nothing 就无法在 a 中找到对应的值。

而事实上,我们不需要条件这么强的拆箱操作。在 m 已是 Functor 的情况下,拆箱操作可以弱化为 join :: m (m a) -> m a。我们尝试用 fmapjoin 合成 >=>

f :: b -> m c
g :: a -> m b

fmap f :: m b -> m (m c)
(fmap f) . g :: a -> m (m c)
join . (fmap f) . g :: a -> m c

-- i.e.

f >=> g = join . (fmap f) . g

Functor 的假设是容易成立的。当然我们可以定义多个不同的 fmap,如此产生的 Monad 会有不同的语义。join 的假设也是容易成立的,m (m a) 通常和 m a 包含相同多的信息。故此做法是实际可行的。

我们再考虑 monadic function 和 non-monadic function 结合的问题。期望有如此一个运算:>.> :: (b -> c) -> (a -> m b) -> (a -> m c)。注意,此处返回值是 a -> m c 而不是 a -> c,因为我们不希望 a -> m b 产生的额外信息有所丢失。自然地,我们希望有一个平凡的装箱操作,return :: a -> m a。如此一来便可结合 >=> 完成上面的运算:

f :: b -> c
g :: a -> m b

return . f :: b -> m c
(return . f) >=> g :: a -> m c

-- i.e.

f >.> g :: (return . f) >=> g

non-monadic function 和 monadic function 另一个方向的结合是平凡的。

综上我们可以得到成为 Monad 的基本条件:

  • 是 Functor,存在 fmap :: (a -> b) -> m a -> m b
  • 有一个平凡的拆箱操作 join :: m (m a) -> m a
  • 有一个平凡的装箱操作 return :: a -> m a

为了描述平凡,我们要求三个函数必须满足如下公理(下面的 f 为 non-monadic function):

  1. return . f == (fmap f) . returnreturn 的平凡性)
  2. join . fmap (fmap f) == (fmap f) . joinjoin 的平凡性)

事实上在 Category Theory 中,还有另外两条公理:

  • join . (fmap join) == join . join
  • join . fmap return == join . return == id

以上四条公理描述了 Id(恒等 Functor)、mm^2m^3 之间的泛性质,并使图交换。

Monad Typeclass

以下为 Prelude 中的定义:

class Functor m => m a where

return :: a -> m a
(>>=) :: m a -> (a -> m b) -> m b

此处没有出现 join,也没有 fish operator,而是使用了一个更常用的算符 >>= (通常称为 bind operator)。这是因为在实际中我们不直接将函数结合,而是使用 non-pointfree 的写法。

此外,还有 >> :: m a -> m b -> m b 运算符。return>>=>> 三者是构成 do-notation 的基础。此处不再赘述。

References

READ MORE

Haskell 笔记:Applicative

en

Motivation

Functor solves the problem of mapping regular single-parameter functions into a sub-category, but that’s not easy for functions with more than one parameter.

Let’s consider a function with two parameters f :: a -> b -> c, which can also read as a -> (b -> c). Applying fmap on f will yield fmap f :: m a -> m (b -> c). That’s still distant from what we expect: f' :: m a -> m b -> m c. To get f', we need a transform from m (b -> c) to m b -> m c. Here we denote it as <*> :: m (b -> c) -> m b -> m c. We will later show that such transform is universal for functions with more parameters.

Now consider a function with three parameters f :: a -> b -> c -> d. We are going to transform it into a wrapped-value version, with the help of fmap and <*>.

f :: a -> b -> c -> d

(fmap f) :: m a -> m (b -> (c -> d))

\a_ b_ -> (fmap f a_) <*> b_
:: m a -> m b -> m (c -> d)

\a_ b_ c_ -> ((fmap f a_) <*> b_) <*> c_
:: m a -> m b -> m c -> (m d)

Here \a_ b_ c_ -> ((fmap f a_) <*> b_) <*> c_ is in the desired type. For most of the time, applying parameters directly is actually what we want, instead of the function itself, so the code could simply be written as ((fmap f a) <*> b) <*> c, where a, b and c are wrapped values. Parenthesis could be omitted if precedences are set properly, which leads to a neat and easy-to-read form:

f `fmap` a <*> b <*> c

In haskell, fmap has an infix name <$>. So finally we get: f <$> a <*> b <*> c.

Applicative

Haskell pre-defines a type class Applicative, which captures the pattern of <*>. Any type that implements Applicative works well with <$> and <*>.

class Functor f => Applicative (f :: * -> *) where
pure :: a -> f a
(<*>) :: f (a -> b) -> f a -> f b
GHC.Base.liftA2 :: (a -> b -> c) -> f a -> f b -> f c
(*>) :: f a -> f b -> f b
(<*) :: f a -> f b -> f a

Note that an Applicative is also a Functor. Apart from <*>, there are some other helper functions or operators in Applicative.

pure is equivalent to the default value constructor of f, e.g. (:[]) for List or Just for Maybe. This may be handful when lifting an unwrapped value to a wrapped one.

liftA2 transforms a binary operator to the corresponding version. The function exists as binary operators would be frequently passed among high-order functions.

*> takes two wrapped parameters and simply returns the second one, which sequence up two wrapped values. This is quite useful for Applicative with action semantics, such as IO. In fact, it’s so useful that Haskell introduces a syntax sugar for it, known as the do-notation. Particularly:

do
putStrLn "1"
putStrLn "2"

is equivalent to

putStrLn "1" *> putStrLn "2"

<* is similar. Both will be reviewed while studying Monad.

READ MORE

Haskell 笔记:Category Theory and Functor

en

Category Theory

A category consists of three parts:

  • A collection of objects.
  • A collection of morphisms, each of which map one object to another.
  • A composition operator of these morphisms, i.e. morphisms can be composed. If f: A -> B and g: B -> C are morphisms, f.g generates a new morphism A -> C.

Note that a morphism has no specific semantics of mapping, but simply links two objects together. Morphisms are also called Arrows.

Examples

Set Category: Set

All sets and standard functions form a category. Functions need not to be surjective, since morphisms have no mapping semantics.

Group Category: Grp

All groups and homomorphisms between groups form a category. A group has specific algebaric structure, which morphisms should preserve.

Laws

Three laws that a category should obey:

  • Composition should be associative.
  • Composition operation should be enclosed in the category, i.e. if f: A -> B and g: B -> C, there must be a h: A -> C satisfying h = f . g.
  • For each object A, there should exist an identity morphism id(A): A -> A s.t. for every f: A -> B, f = id(A) . f = f . id(B).

Note that:

  • There may exist serveral morphisms between A and B.
  • An identity has type A -> A, but a morphism with such type needs not to be an identity.

Functors in Category Theory

A functor maps a category to another category. It should contains two mappings for objects and for morphisms, with composition operation and category laws preserved.

There’s a trivial functor from Grp to Set, which maps groups to their underlying sets, and group morphisms to functions with same behavior but defined on sets instead of groups.

Paramateric Types in Haskell

It’s common to create new types that hold values of other types. List[a] type constructor creates types that holds sequential values of same type; Maybe[a] creates types that hold operation states (failure, or success with returned values).

Usually we expect derived types to inherit functions from types being wrapped. For example, List[Int] should have element-wise addition as Int does, and Maybe[Int] should have similar operations with no burden of re-wrapping and unwrapping. Such ‘inheritance’ should be done automatically if possible, since it is only concerned with the structure of types instead of specific functions.

Hask Category

Haskell language itself forms a category, with all types being objects, and functions being morphisms. Such category is called Hask.

Hask Functors

class Functor m where
fmap :: (a -> b) -> m a -> m b

A parameteric type implementing class Functor is a category functor, mapping Hask to one of its sub-category, where types m a are the object collection. The type constructor m maps objects, and specific fmap defined on m maps corresponding functions.

It’s worth noted that (a -> b) -> m a -> m b can also read as (a -> b) -> (m a -> m b), as -> is right-associative. This may provide a clearer view of fmap, which takes a regular function in Hask and returns the corresponding function in sub-category.

Examples:

fmap (+) :: List[Int] -> List[Int] generates element-wise addition in List[Int].

fmap (+) :: Maybe Int -> Maybe Int generates such function:

maybePlus :: Maybe Int -> Maybe Int
maybePlus _ Nothing = Nothing
maybePlus Nothing _ = Nothing
maybePlut (Just x) (Just y) = Maybe (x + y)
READ MORE

Haskell 笔记:data, type, newtype

zh

新类型有自己的 data constructor(literals 可以看成特殊的 data constructor),由这一点来区分是否创建了新类型。

  • data 创建了新类型,可以有多个 data constructor。
  • newtype 创建了新类型,只能有一个 data constructor,同时新类型的内存布局与原来的类型相同。
  • type 没有创建新类型,只是建立了 alias,没有新的 data constructor。

type

常用于语义化类型,是业务逻辑层的概念。

type ID = Int

a = 1 :: ID
b = a + 2 -- legal

showID :: ID -> IO ()
showID x = print x -- legal, since Int has already been an instance of class Show

-- illegal, since Int has already been instantiated
instance Show ID where
-- ...

newtype

在编译期创建新类型,但差异在运行期被抹去。带有一个构造器。

newtype ID' = ID' Int

a = ID' 1
b = a + 2 -- illegal, since Int and ID' are totally different types

showID' :: ID' -> IO ()
showID' x = print x -- illegal, since ID' is not an instance of Show

-- either
showID' (ID' x) = print x
-- or
instance Show ID' where
show (ID' x) = show x
READ MORE

Haskell 笔记:folds

zh

Prelude.foldl

foldl 为 left-associative folding。

foldl :: (b -> a -> b) -> b -> [a] -> b
foldl f acc [] = acc
foldl f acc (x:xs) = foldl f (f acc x) xs

foldl (+) 0 [1..3] 等价于 (((0 + 1) + 2) + 3)

  • 尾递归,因此有 strict 版本 foldl'
  • 求值时必须先到达栈底,遍历完列表,因此无法处理无穷列表

Data.List.foldl’

foldl'foldl 的 TRO 版本。

Prelude.foldr

foldr 为 right-associative folding。

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f acc [] = acc
foldr f acc (x:xs) = f x (foldr f acc xs)

foldr (+) 0 [1..3] 等价于 (0 + (1 + (2 + 3)))

  • 没有尾递归,有爆栈的危险。
  • 有向右展开的特点,而 Haskell 中许多数据结构都有向右递归的特点(如 Cons),因此可以很好地处理无穷递归的数据,从而更加通用。

Prelude.foldl1 && Prelude.foldr1

Helper functions。将 operator 限制为同一种类型,同时约去 accumulator。

foldl1 :: (a -> a -> a) -> [a] -> a
foldl1 f (x:xs) = foldl f x xs
foldl1 _ [] = error

foldr1 :: (a -> a -> a) -> [a] -> a
foldr1 f (x:xs) = foldr f x xs
foldr1 _ [] = error

即,foldr1 将列表的第一个值作为 accumulator,将剩余部分作为 list,传给 foldrfoldl 同理。

实践

用 folds 实现 reverse

reversel, reverser :: [a] -> [a]
reversel list = foldl (\acc x -> x : acc) [] list

reverser list = foldr (\x acc -> acc ++ [x]) [] list

用 foldr 实现 foldl

先归纳出 foldr 的泛性质。如果一个函数 g s.t.

g [] = v
g (x:xs) = f x (g xs)

g list === foldr f v list.

再看 foldl 的定义:

foldl f v [] = v
foldl f v (x:xs) = foldl f (f v x) xs

===>

foldl f v list = g list v
where
g [] v = v
g (x:xs) v = g xs (f v x)
-- 从左到右依次更新 v

===>

foldl f v list = g list v
where
g [] = id
g (x:xs) = \v -> g xs (f v x)

应有 g (x:xs) === k x (g xs),我们计算 k

g (x:xs)       === k x (g xs)
g (x:xs) v === k x (g xs) v
g xs (f v x) === k x (g xs) v
(g xs) (f v x) === k x (g xs) v
g' (f v x) === k x g' v
k === \x g' v -> g' (f v x)

所以

foldl f v xs =
(foldr
(\x g' v -> g' (f v x))
id
xs
) v
READ MORE