Situation

Since yesterday evening (18 hours ago), git annex repair --verbose is repairing a repository from a remote. This is on a fast machine (i7 4 physical cores, 8 logical CPUs @2.6Ghz sitting idle, 16GB RAM mostly unused, hard drive with measured 111MB/s sustained capacity). .git folder to repair grew to 8Gb while remote was only about 640MB.

What `git annex repair` does

Currently, git annex repair appears to :

make a complete local clone from the remotes it finds,
expand all packs into individual objects files,
then pour (with rsync) all those objects into the repository
and I guess it ends with a git fsck/gc/whatever to glue things back.

The expected result (a complete repaired repo) is great but it didn't work without help and the performance is disappointing.

Suggested room for improvements

I would be willing to contribute some patches and although I have a respectable experience in programming, including some functional traits, I'm not savvy enough in Haskell at the time.

(1) a complete clone in this case means between one and two hours of download and easily gets interrupted losing all eforts (just like a plain git clone). Actually I tried several times and it never completed. I worked around this by doing a git clone on the server and rsyncing that to a local storage and adding that locally as a git remote.

(2) Even when a local "git remote" is available, git annex repair first tried the network one instead. Perhaps it would be better if it sorted git remotes and first try the ones that appear to be available locally (no URL or file:// URL scheme)? Workaround: manually break the transfer to that git repair switches to the next remote.

(3) does git annex repair really have to explode the repository into individual objects? In my case it took about one hour to create 1454978 (one million four hundred thousands) object files for a total of 6.8GB. (I should have put TEMP and TMPDIR to point to a SSD-based storage as a workaround or even dare a tmpfs.) git-annex then runs a rsync that has been going on thrashing the disk (I can hear and feel it) for 7 hours and a half, with an expected total time of 8 hours 20 minutes. That's a very inefficient way to copy 6.8GB (incidentally, rsync does it in alphabetical order of hash as shown by strace and confirmed by man page and here and there). There must be a more efficient way, right?

A a sidenote, I don't know how a repo containing about 300k files jumped to 1400k git objects within the last 2 months.

Any feedback welcome, thanks.

RSS Atom

Experiment to run git-annex-repair as fast as possible.

Introduction

There has been some progress somehow. I managed to have git-annex-repair somehow succeed while complaining (see details in git-annex-repair claims success then failure).

I've had a look at:

Perhaps I'll use git-annex-forget after I've rebuilt the repo.

Approach: how to speed up git-annex-repair with warm filesystem cache

With a rather "violent" approach I could have it run in 78 minutes instead of thrashing for tens of hours. The approach is:

copy .git folder (without .git/annex/objects, yet copied .git/objects takes about 9GB) to a SSD,
have a shell command line repeatedly make a tar of that to keep all that data warm in the filesystem cache (it repeatedly made a 4.3GB tar to /dev/null nearly always under one minute, even down to 7 seconds quite a few times)
use a tmpfs as temp directory mkdir /dev/shm/tmp/ ; export TMPDIR=/dev/shm/tmp/ ; export TEMP=/dev/shm/tmp/
then in that context git-annex-repair --force

This approach is intended to ensure the fastest processing time for git-annex-repair by providing it fully warm filesystem cache. Since no L1 L2 or L3 cache is gigabyte-sized, that means all this runs at RAM speed.

Performance result

That machine capable of processing (make a tar of) the whole repo in 7 seconds ran the git-annex-repair process for 78 minutes (never more than one core busy at a time) and completed. 78 minutes is enough to make between 70 and 600 tars of the full content that git-annex is supposed to repair. IIRC the CPU was not active all the time.

In other words, currently, repairing a repository looks as costly somehow as reading it many tens or hundreds of time. Intuition says it probably doesn't have to be, even considering costs like computations and launching external processes.

Quick analysis

I have seen with strace that git-annex-repair runs a number of git show commands. Actually, most of the processing time appears to be in those git-show commands.

Hunch about what happens

git-show (at least sometimes) first walks a lot of (all ?) .git/object content.

git-show (at least sometimes) spit tens of megabytes of data, including full-text patches of changed files.

Adding --stat to the git show command makes it much much quicker.

A few figures

I took one example of git-show command and ran it separately.

With --stat, cold cache: 41 seconds to produce 10MB.

$ git --literal-pathspecs show somehash --stat | pv | wc

10,2MiO 0:00:41 [ 251KiB/s]
 134215  536851 10736844

With --stat, warm cache: 2.5 seconds to produce 10MB.

$ time git --literal-pathspecs show somehash --stat | pv | wc

10,2MiO 0:00:02 [4,08MiB/s]
 134215  536851 10736844

real    0m2.514s
user    0m1.492s
sys 0m1.044s

Without --stat, 502 seconds to produce 57MB.

$ time git --literal-pathspecs show somehash | pv | wc

 2243  time git --literal-pathspecs show somehash | pv | wc

57,4MiO 0:08:22 [ 116KiB/s]
 939462 2818382 60230220

real    8m22.846s
user    0m57.716s
sys 7m25.912s

Summary

Does git-annex-repair need to ask git to reconstruct text-diff-style information from compressed data? Is that the source of the bad performance?
What does git-annex-repair need in git-show output?
Can git-annex-repair be made faster (much faster?) by tuning the way it calls git?

Comment by stephane-gourichon-lpad — Fri Jul 28 04:27:06 2017

Remove comment

Add a comment