I was once working as an intern at MSRA around two years ago, at which I joined a research project and started developing upon a large codebase. It’s a practice in ML research fields to adopt an existing code repository as codebase, instead of crafting everything from scratch. Such codebases usually come with convenient “infrastructures” , so researchers would not have to implement them once again, which could be time-wasting and error-prone. All we need is to write our models and losses, and put them into experiments.

The flow works just fine if you are proposing minor improvement on algorithms. The codebase provides an easy approach to prove and iterate your idea. But things would get worse if your work goes beyond it, especially touching the encapsulated infrastructures. Those convenient parts would constraint you and enforce your code into spaghetti.

At that time we were working on a new algorithm for image segmentation problem. The algorithm proposed a pipeline that is totally different from previous ones. To match it we had to introduce new data preprocessor as well as training scheme. The codebase, however, was designed for previous algorithms and presumed a traditional pipeline. It was as solid as a rock and we could hardly put our customization in.

We kept stuffing dozens of lines of code into the codebase. Most of them are badly-designed, repetitive and tightly coupled. We were at that time desperately catching up a conference deadline, applying every effort we had to figure out the optimal setting. Any irrelevant re-factoring would be considered time-consuming and risky. The development went on for months, and finally it grew up into a giant and terrifying monster. Here I would like to share two issues we’ve encountered.


The first one happens in pairing a model with corresponding data loader. A traditional segmentation algorithm takes images (img) as input and is supervised by ground truth segmentation maps (gtseg). The data loader in codebase, therefore, default to yield a tuple of (img, gtseg) for each training iteration. Whereas in our method, the algorithm expects another two kinds of supervision gtdist and gtoffset, which requires totally different logic for loading and pre-processing.

Okay. So now we have two kinds of data loader, one for traditional methods, another for our method. We reserve a configuration entry loader_type for selecting a specific loader. The configuration would be firstly passed to a class Trainer, then to a DataLoaderBuilder to instantiate the chosen loader.

The class Trainer is fundamental in our program. It takes charge of all the instantiation for main components , and maintains the logic of training loop and evaluation. The design presents a hierarchy like

%3 Trainer Trainer user->Trainer loader_type DataLoaderBuilder DataLoaderBuilder Trainer->DataLoaderBuilder loader_type Model Model Trainer->Model

It should be alright when there’s only two kinds of loader. But things got complicated as the experiments proceed. During the months We’ve tried dozens of model designs for seeking an optimal one. Some of them should be fed with a combination of input that is different from the two before . More loaders popped out in support of those models. We began to mess up, since it was a tedious nightmare to keep loader_type in sync with the model in each configuration file.


The second is a rather common problem in training models. Say you have designed a multi-stage training pipeline, where you would like the model to switch its behavior at some point. In the first X iterations, we disable a component A of model for warming up; while after that, it is enabled again for normal training. The catch is, how to make a deeply rooted component aware of the iteration number?

Back to our codebase. We had a Trainer in charge of everything. It starts a training loop, in which the iteration number lies as a local variable. It also holds a reference to the model. The model has a hierachical structure, and component A hides deeply in some layers.

class Trainer:
model: "Model"
def train(self):
for iter_num, data_batch in enumerate(self.data_loader):
self.model.forward(data_batch)
...

class Model:
A: "ComponentA"
def forward(self, data_batch):
...
self.A.forward(data_batch)

class ComponentA:
def forward(self, data_batch):
# How can I access iter_num?

The stuff was implemented in a rough way at that time – we add a second argument for both Model.forward() and ComponentA.forward(), and pass iter_num down along the path.

class Trainer:
model: "Model"
def train(self):
for iter_num, data_batch in enumerate(self.data_loader):
self.model.forward(data_batch, iter_num)
...

class Model:
A: "ComponentA"
def forward(self, data_batch, iter_num):
...
self.A.forward(data_batch, iter_num)

class ComponentA:
def forward(self, data_batch, iter_num):
# How can I access iter_num?

Jesus it is dirty. The argument passing “contaminates” all functions it goes through. Whether or not expecting, they have to accept an extra argument. What if more components would like to access the states? What if more states would be passed? Every single change would have to modify a large area of code. Nobody would like it. At least I won’t.


Now let’s move to a higher level for some deeper thoughts. In the first example, we choose to initiate model and data loader separately. The crux is, they are not uncorrelated components. The choice of model decides what shape input data would be like, and further determines the type of loader. We in fact have a graph like

%3 Trainer Trainer Model Model Trainer->Model initiate DataLoaderBuilder DataLoaderBuilder Trainer->DataLoaderBuilder initiate Model->DataLoaderBuilder loader_type

Ideally, DataLoaderBuilder should “contact” with Model to obtain information required for building loader. But we couldn’t, due to the limitation from hierarchy. The only possible path for message passing is Model -> Trainer -> DataLoaderBuilder. It would however turns Trainer into a “god object”, passing messages around between its children. Having a god object is considered to be a bad practice . Components are tightly coupled to their parents, and maintenance becomes difficult. The second is similar, except we are making Model into the broker between Trainer and ComponentA.

A more generalized version of the problem: In a system with tree-like hierarchical structure, how would the communication be made between two non-adjacent components?

%3 A A B B A->B C C A->C D D B->D ?? C->D

It is not some kind of novel research problem, but one already addressed in practical scenes. Following the single-responsibility principle, we can use a standalone service responsible for managing the communication. Such would be much common in modern Web development, since web components are usually organized in a tree and pass messages more frequently. Mature and production-ready solutions exist like Event-Bus pattern or centralized state management , which are all instances of the design pattern. Instead of relying on the target (or the path to the target), the components now depend only on the service object, and the system becomes less coupled.

So why won’t we use the techniques? Well, if some libaries integrate the stuff, we are glad to try; if not, we have to implement by ourselves, but sorry, we are running out of time.

For programmers in production group, they care more about coupling, otherwise the maintenance is getting painful. They would apply every best practice and design patterns that could be found from textbooks or from some blog posts.

But as researchers, we might have taken the course of software engineering, but we seldom do this. I’ve skimmed so much released code for papers on Github, most of which have their logic for building model, loading data and training tightly coupled, fragile and with mere flexibility for extending. They might be enough to showcase the papers, but are far from a good codebase. But sometimes we have no choice but to extend upon it. There are indeed someone paying efforts to make well-designed and easy-to-extend codebases , but apparently they could not cover all extension demand from developers. We are from time to time being limited by our codebase, badly-designed or over-designed. What’s worse, we have a deadline ahead, and to rush out the idea, we are practicing so many anti-patterns – communicating via global variables or god object, duplicating the logic here and there, or writing meaningless boilerplate codes. The codebase would finally grow into spaghetti. Badly-cooked spaghetti.

It was then I began to think about why practices for production could hardly apply to a research project. The answer is that a research project is not production-ready, but evolving and iterating rapidly with aimless target, rather like a prototype. A prototype might grow into a production, but research project won’t, mostly ending after some paper deadlines. The dogmatism of design patterns are too verbose, and sometimes complicated. Researchers seldom use them, but run for some easy-to-use-but-dirty hacking or tricks.

And that’s the background of hsfzxjy/mocona. It implements some patterns like Dependency Injection and Event Emitter, in addressing the problem of communication between components. The library is deliberately designed to be “magical”, that is, do most of the heavy work behind the scene, but expose a very simple interface or (self-made) “syntax” for users. It is evil and an anti-pattern to be implicit and magical in Python. But there’re people tired or more afraid of verbosity, for which they are willing to write even worse code. If the library could help, they would be glad to make a trade-off between verbosity and anti-pattern.