Git sparse-checkout and partial clones for Mega-Repos

2023-05-05 | Memo

Recently I was planning to make a small contribution to project DefinitelyTyped/DefinitelyTyped. It is a huge repository that hosts and provides the type declaration files for thousands of packages on npm, with a structure like the following

types/
├── 11ty__eleventy-img
├── 1line-aa
├── 3box
├── ...and more of them...

Each sub-folder under the types/ directory corresponds to a specific npm package.

The package I am interested in is marked, whose type declaration files reside in types/marked/ directory. Towards this end, cloning the whole repository and doing a full checkout is considered worthless, since my disk space and network traffic would be eaten up by a bunch of irrelevant files.

Then I read several lines of instructions in the README file, prompting that I could pair partial clone and sparse-checkout to achieve more efficient workflow. The instructions target a newer version of git 2.27, which doesn’t work for my environment with git 2.25. After some research I found the ones at Bring your monorepo down to size with sparse-checkout | Github Blog for my case. If you would apply the steps in this post, make sure to check the version of git on your hand.

To start with, we clone the forked repository with some additional arguments:

$ git clone --filter=blob:none --depth=1 --no-checkout [email protected]:hsfzxjy/DefinitelyTyped

With --filter=blob:none option, files in the repo won’t be fetched until they are needed in the future. This would be helpful where the repo contains large amounts of files but those of interest only take up a small proportion.

The --depth=1 option creates a shallow clone with the commit history truncated and accelerates the cloning process. If you execute git log afterwards, only a single commit would be displayed.

And finally, the --no-checkout option tells git to not checkout any files, leaving simply a .git directory in the working area.

Finished the cloning, we run a special command to initialize the configuration for git sparse-checkout:

$ cd DefinitelyTyped/
$ git sparse-checkout init --cone

The --cone option, as described here, enables the “Cone Mode” that brings improved performance during the checkout. After that, we use git sparse-checkout set <pattern> to take specific part of the repository into the working area:

$ git sparse-checkout set types/marked

Now we should have the needed files ready under the root directory and the types/marked/ directory, if you like a tree command to see:

$ tree
.
├── azure-pipelines.yml
├── dangerfile.ts
├── LICENSE
├── notNeededPackages.json
├── package.json
├── README.es.md
├── README.it.md
├── README.ja.md
├── README.ko.md
├── README.md
├── README.pt.md
├── README.ru.md
├── README.zh-Hans.md
└── types
    └── marked
        ├── index.d.mts
        ├── index.d.ts
        ├── marked-tests.ts
        ├── OTHER_FILES.txt
        ├── package.json
        ├── tsconfig.json
        ├── tslint.json
        └── v3
            ├── index.d.ts
            ├── marked-tests.ts
            ├── tsconfig.json
            └── tslint.json
3 directories, 24 files

From here on, the daily git operations such as commits and pushes would proceed as per usual.

References

Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

Git

OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.