Scalability Issues
I am struggling with very slow operations in my cluster setup and looking for some advice.
Setup
My repository contains 24 million files with a total of 2.3 terabyte. The files are divided into several sub folders so the file count per directory is below 20k. I am using the git-annex cluster setup with one gateway and 16 nodes.
The gateway repository stores a copy of all files.
The distribution of the files across the 16 nodes should be random but deterministic. Therefore, I want to use the filehash property (SHA-256) and store all files starting with one of the 16 characters 0-f
on one of the 16 nodes. For example, node-01
stores all files starting with filehash 0...
and node-16
stores all files starting with filehash f...
.
Since I did not find an immediate preferred content expression for this special distribution method, I decided to add the hash as metadata to every annexed file using a pre-commit hook. Inside the git-annex config, I define for each node that it only stores files with one of the characters.
wanted node-01 = metadata=hash=0*
wanted node-02 = metadata=hash=1*
...
wanted node-16 = metadata=hash=f*
Issues
Adding files
While adding more and more files with git-annex, I noticed that (recording state in git...)
takes very long and my whole memory (8GB) is used by git-annex.
Therefore, increasing the annex.queuesize
at the expense of more memory is not possible [1] .
Is there anything else to adjust? Why does it use so much RAM? Is my number of files out of scope for git?
Commiting files
Furthermore, I use the following pre-commit-annex
hook to add the filehash as metadata to the files.
However, it takes almost one second for each file to add the metadata.
I am trying to adjust the script to support batches as well but maybe there is a different way to distribute the files across nodes by the filehash?
#!/bin/bash
if git rev-parse --verify HEAD >/dev/null 2>&1; then
against="HEAD"
else
# Initial commit: diff against an empty tree object
against="4b825dc642cb6eb9a060e54bf8d69288fbee4904"
fi
process() {
file_hash=$(readlink "$file_path" | tail -c 65)
if [[ -n "$file_hash" ]]; then
git -c annex.alwayscommit=false annex metadata --set "hash=$file_hash" --quiet -- "$file_path"
fi
}
if [[ -n "$*" ]]; then
for file_path in "$@"; do
process "$file_path"
done
else
git diff-index -z --name-only --cached $against | sed 's/\x00/\n/g' | while read file_path; do
process "$file_path"
done
fi
[1] https://git-annex.branchable.com/scalability/
I would be glad for any feedback
24 million files is a lot of files for a git repository, and is past the threshhold where git generally slows down, in my experience.
With that said, the git-annex cluster feature is fairly new, you're the first person I've heard from using it at scale, and if it has its own scalability problems, I'd like to address that. And I generally think it's cool you're using a cluster for such a large repo!
Have you looked at the "balanced" preferred content expression? It is designed for the cluster use case and picks nodes so content gets evenly balanced amoung them. Without needing the overhead of setting metadata.
The reason your pre-commit-annex hook script is slow is that running
git-annex metadata
has to update the.git/annex/index
file, which you'll probably find is quite a large file. And git updates index files, by default by rewriting the whole file.Needing to rewrite the index file is also probably a lot of the slow down of "(recording state in git...)".
There are ways to make git update index files more efficiently, eg switching to v4 index files. Enabling split index mode can also help in cases where the index file is being written repeatedly. Do bear in mind that you would want to make these changes both to
.git/index
and to.git/annex/index
Your pre-commit-annex hook is running
git-annex metadata
once per file, so the index gets updated once per file. Rather than runninggit-annex metadata
in a loop, that can also be sped up by using--batch
and feed in JSON, and it will only need to update the index once.Thanks a lot joey for your help.
I gave it another try without setting the metadata and by using v4 index.
Instead of directly adding all files to the gateway repository, I distributed the files equally across the 16 nodes to make use of their resources. On each node I added the file portion to a git-annex repository in order to merge them later via the gateway repository. Adding the files on each node worked very well using the
--jobs="cpus"
flag.However, once I tried to merge all 16 repos using
git-annex sync --no-content --allow-unrelated-histories --jobs="cpus"
all of the nodes crashed due to out-of-memory during this step:remote: (merging synced/git-annex bigserver/git-annex into git-annex...)
I assume that you are right and that I simply have too many files.
Unfortunately, I currently cannot spend more time on investigating the issues. Thanks again for your help.