Situation
Since yesterday evening (18 hours ago), git annex repair --verbose
is repairing a repository from a remote. This is on a fast machine (i7 4 physical cores, 8 logical CPUs @2.6Ghz sitting idle, 16GB RAM mostly unused, hard drive with measured 111MB/s sustained capacity). .git
folder to repair grew to 8Gb while remote was only about 640MB.
What git annex repair
does
Currently, git annex repair
appears to :
- make a complete local clone from the remotes it finds,
- expand all packs into individual objects files,
- then pour (with rsync) all those objects into the repository
- and I guess it ends with a git fsck/gc/whatever to glue things back.
The expected result (a complete repaired repo) is great but it didn't work without help and the performance is disappointing.
Suggested room for improvements
I would be willing to contribute some patches and although I have a respectable experience in programming, including some functional traits, I'm not savvy enough in Haskell at the time.
(1) a complete clone in this case means between one and two hours of download and easily gets interrupted losing all eforts (just like a plain git clone). Actually I tried several times and it never completed. I worked around this by doing a git clone
on the server and rsync
ing that to a local storage and adding that locally as a git remote.
(2) Even when a local "git remote" is available, git annex repair
first tried the network one instead. Perhaps it would be better if it sorted git remotes and first try the ones that appear to be available locally (no URL or file:// URL scheme)? Workaround: manually break the transfer to that git repair
switches to the next remote.
(3) does git annex repair
really have to explode the repository into individual objects? In my case it took about one hour to create 1454978 (one million four hundred thousands) object files for a total of 6.8GB. (I should have put TEMP
and TMPDIR
to point to a SSD-based storage as a workaround or even dare a tmpfs.) git-annex
then runs a rsync
that has been going on thrashing the disk (I can hear and feel it) for 7 hours and a half, with an expected total time of 8 hours 20 minutes. That's a very inefficient way to copy 6.8GB (incidentally, rsync does it in alphabetical order of hash as shown by strace
and confirmed by man page and here and there). There must be a more efficient way, right?
A a sidenote, I don't know how a repo containing about 300k files jumped to 1400k git objects within the last 2 months.
Any feedback welcome, thanks.
Introduction
There has been some progress somehow. I managed to have
git-annex-repair
somehow succeed while complaining (see details in git-annex-repair claims success then failure).I've had a look at:
Perhaps I'll use
git-annex-forget
after I've rebuilt the repo.Approach: how to speed up git-annex-repair with warm filesystem cache
With a rather "violent" approach I could have it run in 78 minutes instead of thrashing for tens of hours. The approach is:
.git
folder (without.git/annex/objects
, yet copied.git/objects
takes about 9GB) to a SSD,tar
of that to keep all that data warm in the filesystem cache (it repeatedly made a 4.3GBtar
to/dev/null
nearly always under one minute, even down to 7 seconds quite a few times)mkdir /dev/shm/tmp/ ; export TMPDIR=/dev/shm/tmp/ ; export TEMP=/dev/shm/tmp/
git-annex-repair --force
This approach is intended to ensure the fastest processing time for git-annex-repair by providing it fully warm filesystem cache. Since no L1 L2 or L3 cache is gigabyte-sized, that means all this runs at RAM speed.
Performance result
That machine capable of processing (make a tar of) the whole repo in 7 seconds ran the
git-annex-repair
process for 78 minutes (never more than one core busy at a time) and completed. 78 minutes is enough to make between 70 and 600tar
s of the full content that git-annex is supposed to repair. IIRC the CPU was not active all the time.In other words, currently, repairing a repository looks as costly somehow as reading it many tens or hundreds of time. Intuition says it probably doesn't have to be, even considering costs like computations and launching external processes.
Quick analysis
I have seen with
strace
thatgit-annex-repair
runs a number ofgit show
commands. Actually, most of the processing time appears to be in thosegit-show
commands.Hunch about what happens
git-show
(at least sometimes) first walks a lot of (all ?).git/object
content.git-show
(at least sometimes) spit tens of megabytes of data, including full-text patches of changed files.Adding
--stat
to the git show command makes it much much quicker.A few figures
I took one example of git-show command and ran it separately.
With
--stat
, cold cache: 41 seconds to produce 10MB.With
--stat
, warm cache: 2.5 seconds to produce 10MB.Without
--stat
, 502 seconds to produce 57MB.Summary
git-annex-repair
need to ask git to reconstruct text-diff-style information from compressed data? Is that the source of the bad performance?git-annex-repair
need ingit-show
output?git-annex-repair
be made faster (much faster?) by tuning the way it calls git?