[Unravelling mocona] Part 0 - Preface

The early idea of hsfzxjy/mocona was come up with in late April. It was not until July that I figure out a reasonable design for the project. I finished most of my idea and released the first version approaching August. Nevertheless, there’s no chance for me to share the story behind the library. Now another month gone, it’s time to do some writing.

The series, as I planning, would cover the motivation of creating mocona, some technical details and usage, along with some critical thinking on the creative process. For whom interested in CPython internals or would like to extend the language, it is worth to read through.

Table of Contents


Understanding pickle in Python

The module pickle shipped in Python could be used for generic-purpose object serialization and de-serialization. It’s been widely adopted or recommended as backend in scenarios like persisting states or IPC.

Employed by many famous frameworks, though, the magic behind it still seems to be vague for daily users, especially guys fresh to the language. People come across “unpicklable” errors from time to time, but don’t know the reason; or re-invent state persistence by themselves, even if pickle could be competent. People sometimes write error-prone codes, merely because they are afraid of or unaware of pickle.

This post thus attempts to clarify the usage of pickle module in an easy understanding way, by answering three questions.

Read More


Rough Notes on Deploying Vaultwarden & NextCloud Bookmarks

I’ve been struggling for years on two things: synchronize passwords and blog posts I have read across devices. The problem kills me so much since my devices, an Android mobile, an Ubuntu laptop and an iPad, are less supported by big App companies. Aside, I want to gain control for all my data, so there should better exist a self-hosted solution. The problem are partially solved recently by deploying Vaultwarden and NextCloud on VPS. This blog post dictates the setup process and problems I met, in case anyone searching for this topic.

Install Vaultwarden and NextCloud on VPS

The two services are both luckily dockerized. To install there’s nothing more complicated than a command:

Read More


语言狂热者与实用主义者

编程语言界有着一场旷日持久的争论。人们会为了一种(或一类)语言,或是自己熟用的,或是自己所欣赏的,与他人吵得不可开交。而争论的起点,可能只是某人一句小小的抱怨。各方各为其主,剑拔弩张,俨然一次声势浩大的圣战。

尽管众语言不可一概而论,我们还是可以粗略地将争论的人群分为两个派别:语言狂热者与实用主义者。这是光谱的两个极端。语言狂热者关注语言本身,或是钟情于新的、现代化的语言特性,并据此评判一种语言;实用主义者则侧重于语言的工程实践,常会以语言的生态、业界使用率反驳他人。当然,也不乏两者兼具的人,对双方的意见各持有一定比例。

Read More


Demystify the randomness in CUDA kernels

You might have heard that many CUDA operators contains some kind of non-determinism, and to eliminate the randomness, one must pay for the degradation of performance. The warning occurs many times in blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. To this end, the post is going to explore the problem.

When talked about GPU computation, one might come up with a notion of some super-fast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (compared to dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from non-deterministic execution order and data race condition.

Non-deterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across invocations. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.

Data race condition is one of the consequences of non-deterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. To alleviate, CUDA provides atomic arithmetic routines like atomicAdd() or atomicMax() for safe access to shared memory.

By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. But one may argue that, we have atomic operations like atomicAdd(). If a program correctly sums up the same collection of numbers, although the order might be messed, it should always returns the same result. Sadly this is wrong, since some arithmetic operations DOES rely on the order of operands! Let’s take the following CUDA program as an example:

Read More


Performant Bulk Mutations in IndexedDB

IndexedDB seems to be inefficient when working on bulk mutations, such as dumping a huge list of items into an object store – at least I think so at the first sight on the MDN docs. It provides no explicit API for the job as SQL does , so all we can do is to loop from client side, which cannot benefit from database internal optimization (if there’s any). The mutation requests, in addition, appear to be spawned sequentially – the tutorial recommends a paradigm to raise a request within the success event callback of the previous request, which is in fact a sequential execution. Such code will be definitely slow.

We may conduct a quick benchmark on the above approach:

;(async () => {
await new Promise((resolve) => {
const r = indexedDB.deleteDatabase("test")
r.onsuccess = r.onerror = resolve
})
const items = Array.from({ length: 100000 }, (_, i) => ({ id: i }))
const store = await new Promise((resolve) => {
indexedDB.open("test", 1).onupgradeneeded = (event) => {
const db = event.target.result
const store = db.createObjectStore("store", { keyPath: "id" })
store.createIndex("id", "id")
resolve(store)
}
})
console.time("bulkAdd")
await bulkAdd(store, items)
console.timeEnd("bulkAdd")
})()

function bulkAdd(store, items) {
const failures = []
return new Promise((resolve) => {
function _perform(idx) {
const req = store.add(items[idx])
req.onsuccess = (event) => {
if (idx === items.length - 1) resolve(failures)
else _perform(idx + 1)
}
req.onerror = (event) => {
failures.push(items[idx].id)
}
}
_perform(0)
})
}

Practically, we concern more about failed records than the ones inserted successfully. We thus take down only the indices of those records, which improves the efficiency at least a little bit.

The timing is rather unstable, but on average, it takes 30~40 seconds to insert 100k records or 2000~3000 records per second, which is not promising.

Read More


Auto Rebuild .pyx Files with pyximport

Modules written in Cython usually comes with a setup.py script that compiles Cython source codes into native shared libary. For whom not so familiar with Python’s packaging and distributing toolchains, such step is sometimes scary, and turns out to be a stumbling block for Cython freshmen. Moreover, the workflow, “run setup.py -> debug -> edit .pyx files -> run setup.py”, is also less convenient and troublesome for fast iterating projects.

pyximport is a handy tool from Cython official, provided to address the above problem. The module enables users to “directly import” .pyx files, with no explicit setup.py required. Let’s start from an example here. Say we have two files residing in the same directory:

# main.py
import pyximport

# hl: begin
pyximport.install(language_level=3)
# hl: end

import foo

print(foo.func(3))
# foo.pyx
cpdef int sqr(int x):
return x * x

The magical highlighted line registers some import hooks to let Python recognize .pyx files. When the .pyx files imported for the first time or modified later, pyximport compiles or re-compiles them behind the scene automatically.

Read More


Cython and Threads

Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents accessing or manipulating interpreter from different threads concurrently. The mechanism alleviates the risk of race condition, but sequentializes multi-threading program as well. Sadly, there’s no way to release the lock from pure Python.

Alright. So what about beyond pure Python? Shall we bypass the mechanism within an extension? The answer is yes, and that’s what most of scientific computing libaries do.

Cython is a good choice for writing extensions, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a code block using the with nogil: syntax. Will it release the true power of multi-core CPU? We should have a try.

Read More


Obtain a Random Available TCP Port with Bash

On Linux, we might sometimes want to choose an unused TCP port randomly. This occurs from time to time on a server, when the administrator wants to expose an HTTP port for a user. Or, you just need an available port for IPC. Let’s make it happen with pure bash scripting.

function unused_port() {
N=${1:-1}
comm -23 \
<(seq "1025" "65535" | sort) \
<(ss -Htan |
awk '{print $4}' |
cut -d':' -f2 |
sort -u) |
shuf |
head -n "$N"
}

We would take apart the function step by step in the following paragraphs.

Read More


Information Theory: KL Divergence

Assume there are two hypotheses $H_1$ and $H_2$, r.v. $X$ ranged in alphabets $\{a_1,\ldots\,a_k\}$. Under hypothesis $H_i$, $X$ has pdf $p(X=a_j|H_i)=p_i(a_j)$. According to Law of Total Probability, we have:

$$ p(H_i|a_k) = \frac{p(H_i)p_i(a_k)}{p_1(a_k)p(H_1)+p_2(a_k)p(H_2)} $$

The formula can be transformed into:

$$ \log \frac{p_2(a_k)}{p_1(a_k)} = \log \frac{p(H_2|a_k)}{p(H_1|a_k)} - \log \frac{p(H_2)}{p(H_1)} $$

which implies that, $\log \frac{p_2(a_k)}{p_1(a_k)}$ equals the difference of log likelihood ratio before and after conditioning $X=a_k$. We define $\log \frac{p_2(a_k)}{p_1(a_k)}$ be the discrimination information for $H_2$ over $H_1$, when $X=a_k$. The expectation of discrimination information is KL divergence, denoted as:

$$D_{KL}(P_2||P_1) = \sum_k p_2(a_k) \log \frac{p_2(a_k)}{p_1(a_k)} $$

which sometimes denoted as $I(p2,p1;X)$, or simply $I(p2,p1)$ if without ambiguity.

KL Divergence can be interpreted as a measure of expected information for $X$ gained after distribution shifted from $p_1$ to $p_2$, where $p_1$ and $p_2$ regarded as prior and post-prior distributions.

Read More