A Flaw of Promoting Complex Trait Bounds in Rust

Days ago, for some reason, I was trying to implement a function that can polymorphize over its return type. The solution is simple, but my brain was jammed at that time, trapped in some complicated typing tricks for hours.

During the struggling, I coincidently ran into something that is temporarily a flaw in the current Rust compiler implementation. In some cases, the compiler is not smart enough to promote known trait bounds, and we have to replicate them again and again. Although the problem is afterwards proved to be a useless “X-Y Problem”, I would still like to share the story.

Read More

Initialize Process Pool Worker with Individual Value

There could be scenes when you are using multiprocessing.pool.Pool and you want to perform some initialization for each worker before tasks are scheduled via Pool.map() or something alike. For example, you create a pool of 4 workers, each for one GPU, and expect tasks scheduled on Worker-i to precisely utilize GPU-i. In this case, Worker-i should be initialized with env var CUDA_VISIBLE_DEVICES=<i> set.

To initialize spawned workers, the constructor of Pool provides two arguments concerning the job 1initializer and initargs. initializer is expected to be a callable, and if specified, each worker process will call initializer(*initargs) when it starts.

import multiprocessing as mp
import multiprocessing.pool as mpp

def worker(arg1):

mpp.Pool(processes=2, initializer=worker, initargs=(42, ))
# 42
# 42

This is, however, slightly away from what we expect. The initializer is called with same arguments in each worker, while in our case, the arguments are expected to be different, like value 1 for Worker-0 and value 1 for Worker-1. There are two approaches to do the tricks.

Use a Queue

Queue and SimpleQueue types in module multiprocessing 2 implement multi-producer, multi-consumer FIFO queues under the multi-processing scenario. We may create and share a queue among parent and worker processes, send individual values from parent processes and read them from workers. Since the sending and receiving operations are synchronized, we won’t run into any race conditions.

def worker(q):

q = mp.SimpleQueue()
p = mpp.Pool(processes=2, initializer=worker, initargs=(q,))
for i in range(2):
# 0
# 1

Use a Value

Alternatively, we may use a lighter shared object other than a queue. The Value type in module multiprocessing 3 allows sharing simple values across multiple processes. It can also synchronize accesses to values to avoid race conditions if necessary. We can use a Value object to allocate an individual id for each worker process.

def worker(v):
with v.get_lock():
val = v.value
v.value += 1

v = mp.Value(ctypes.c_int32, 0, lock=True)
p = mpp.Pool(processes=2, initializer=worker, initargs=(v,))
# 0
# 1

Rust - Python FFI From Scratch

I was recently working on a side project that involves communication between binaries written in Rust and web interfaces written in Python. Moving a part of my project onto a language like Rust is under several considerations: 1) the logic is all about manipulating byte arrays, where Python has deficit and system language like Rust is superior; 2) the logic happens to be complicated, I need a static type system to ensure the correctness, and also the match expression of Rust is found helpful in getting things concise; 3) I was planning to develop CLI tools with Rust, which calls this fraction of functionality, and I don’t want to rewrite the stuff in the future.

Read More

[Extending Hexo For My Site] Part 1 - Better Mathjax Rendering

I am a heavy user of Mathjax. Mathjax is a library that renders Tex-compatible syntax into pretty equations in web scenarios. Hence I am always mixing up Markdown and Tex snippets in my writing. The annoying part is Tex snippets have low priority in my Markdown renderer, and are sometimes incorrectly rendered into Markdown elements. For instance, $a_1, a_2$ becomes $a1, a2$, where underscores within $...$ are mistakenly recognized as an emphasis element. A bunch of escaping is required to avoid the situation, which drives me mad. So I got to seek a permanant solution.

Read More

Debug a 'torch.tensor(1).cuda()' hanging

Today a user of our GPU cluster ran into a problem where executing python -c 'import torch; torch.tensor(1).cuda() would hang forever and could not be killed. The problem occured on a rather old Docker image (with torch == 0.4.0), and would disappear if newer images were used. It was caused by some far less known coincidents, which surprised me and I want to share in this post.

The Problem

The hanging program is spawned by following command:

/usr/bin/docker run --rm -u 1457:1457 \
--gpus '"device='0,1,2,3'"' \
-v /ghome/username:/ghome/username -v /gdata/username:/gdata/username \
-it --ipc=host --shm-size 64G \
-v /gdata1/username:/gdata1/username -v /gdata2/username:/gdata2/username \
-e HOME=/ghome/username \
-m 48G --memory-swap 48G --cpus 5 \
--name username2 \
bit:5000/deepo_9 \
python3 -c 'import torch; torch.tensor(1).cuda()'

the Docker image bit:5000/deepo_9 he used was built with CUDA-9, while the host has multiple 1080Ti GPU cards and CUDA upgraded to 11.4. Looks like there’s some binary incompatibility, considering the fact that the problem would gone with newer images.

Read More

[Unravelling mocona] Part 1 - Verbosity or Anti-Pattern

I was once working as an intern at MSRA around two years ago, at which I joined a research project and started developing upon a large codebase. It’s a practice in ML research fields to adopt an existing code repository as codebase, instead of crafting everything from scratch. Such codebases usually come with convenient “infrastructures” , so researchers would not have to implement them once again, which could be time-wasting and error-prone. All we need is to write our models and losses, and put them into experiments.

The flow works just fine if you are proposing minor improvement on algorithms. The codebase provides an easy approach to prove and iterate your idea. But things would get worse if your work goes beyond it, especially touching the encapsulated infrastructures. Those convenient parts would constraint you and enforce your code into spaghetti.

Read More

Understanding pickle in Python

The module pickle shipped in Python could be used for generic-purpose object serialization and de-serialization. It’s been widely adopted or recommended as backend in scenarios like persisting states or IPC.

Employed by many famous frameworks, though, the magic behind it still seems to be vague for daily users, especially guys fresh to the language. People come across “unpicklable” errors from time to time, but don’t know the reason; or re-invent state persistence by themselves, even if pickle could be competent. People sometimes write error-prone codes, merely because they are afraid of or unaware of pickle.

This post thus attempts to clarify the usage of pickle module in an easy understanding way, by answering three questions.

Read More

Rough Notes on Deploying Vaultwarden & NextCloud Bookmarks

I’ve been struggling for years on two things: synchronize passwords and blog posts I have read across devices. The problem kills me so much since my devices, an Android mobile, an Ubuntu laptop and an iPad, are less supported by big App companies. Aside, I want to gain control for all my data, so there should better exist a self-hosted solution. The problem are partially solved recently by deploying Vaultwarden and NextCloud on VPS. This blog post dictates the setup process and problems I met, in case anyone searching for this topic.

Install Vaultwarden and NextCloud on VPS

The two services are both luckily dockerized. To install there’s nothing more complicated than a command:

Read More

Demystify the randomness in CUDA kernels

You might have heard that many CUDA operators contains some kind of non-determinism, and to eliminate the randomness, one must pay for the degradation of performance. The warning occurs many times in blog posts or framework documentation, but few of them give a detailed explanation for the source of randomness. To this end, the post is going to explore the problem.

When talked about GPU computation, one might come up with a notion of some super-fast hardwares. The surprising speed comes from intensive parallelism of the architecture, which allows users to run thousands of routines on parallel (compared to dozens on ordinary CPUs). The routines are called threads, and similar to the concept with the same name in operating systems, they suffer from non-deterministic execution order and data race condition.

Non-deterministic execution order means, if we arrange all instructions of different threads into a sequence, ordered by their occurrence time, the sequence could vary greatly across invocations. If two threads run on parallel, each with a single instruction, we cannot tell which one is executed first. This is the fundamental origin of randomness, and is inevitable.

Data race condition is one of the consequences of non-deterministic execution order. When the threads is manipulating some shared variables, and the manipulation is not atomic, i.e. consists of interruptible instruction sequence, the program might yield undesired results. Programs should be carefully designed to avoid race condition, with the help of locks or atomic operations. To alleviate, CUDA provides atomic arithmetic routines like atomicAdd() or atomicMax() for safe access to shared memory.

By far we have seen that there does exist some kind of randomness inside GPUs, and if not handled properly, our program will give incorrect results when working with shared variables. But one may argue that, we have atomic operations like atomicAdd(). If a program correctly sums up the same collection of numbers, although the order might be messed, it should always returns the same result. Sadly this is wrong, since some arithmetic operations DOES rely on the order of operands! Let’s take the following CUDA program as an example:

Read More

Performant Bulk Mutations in IndexedDB

IndexedDB seems to be inefficient when working on bulk mutations, such as dumping a huge list of items into an object store – at least I think so at the first sight on the MDN docs. It provides no explicit API for the job as SQL does , so all we can do is to loop from client side, which cannot benefit from database internal optimization (if there’s any). The mutation requests, in addition, appear to be spawned sequentially – the tutorial recommends a paradigm to raise a request within the success event callback of the previous request, which is in fact a sequential execution. Such code will be definitely slow.

We may conduct a quick benchmark on the above approach:

;(async () => {
await new Promise((resolve) => {
const r = indexedDB.deleteDatabase("test")
r.onsuccess = r.onerror = resolve
const items = Array.from({ length: 100000 }, (_, i) => ({ id: i }))
const store = await new Promise((resolve) => {
indexedDB.open("test", 1).onupgradeneeded = (event) => {
const db = event.target.result
const store = db.createObjectStore("store", { keyPath: "id" })
store.createIndex("id", "id")
await bulkAdd(store, items)

function bulkAdd(store, items) {
const failures = []
return new Promise((resolve) => {
function _perform(idx) {
const req = store.add(items[idx])
req.onsuccess = (event) => {
if (idx === items.length - 1) resolve(failures)
else _perform(idx + 1)
req.onerror = (event) => {

Practically, we concern more about failed records than the ones inserted successfully. We thus take down only the indices of those records, which improves the efficiency at least a little bit.

The timing is rather unstable, but on average, it takes 30~40 seconds to insert 100k records or 2000~3000 records per second, which is not promising.

Read More