Hi,
The git annex seem has problem with many files.
For synchronize, the operation lasts 8 hours. Here the sample for synchronizing to my local remote server (sbackup)
start at 20:12 / end at 04:13 / total time = ~ 8 hours
git annex sync sbackup
[2015-04-13 20:12:26 CEST] call: git ["--git-dir=.git","--work-tree=.","push","sbackup","+git-annex:synced/git-annex","master:synced/master"]
Counting objects: 792155, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (789727/789727), done.
Writing objects: 100% (792155/792155), 75.73 MiB | 2.35 MiB/s, done.
Total 792155 (delta 449604), reused 1 (delta 0)
To partage@192.168.253.53:/data/samba/git-annex/docshare
ae182f0..fad3aca git-annex -> synced/git-annex
e0e67fe..5226a6f master -> synced/master
[2015-04-14 04:13:05 CEST] read: git ["--git-dir=.git","--work-tree=.","push","sbackup","git-annex","master"]
ok
Another problem, I do not know exactly how many files I own (i use find . | wc -l )
.git = 1250633
documents = 61124
medias = 199504
it seem i own ~250000 files, but in the .git 1.2 millions files.
The following command also very slow
git annex info
What the best pratices for use git annex with many files > 500 000 or maintenance & reduction/cleaning method
Thanks
If you want a file count:
is probably the best measure.
I have an annex with about several million files in it and it is slow, but not as slow as you are describing. Have you done a repack/gc cycle?
@CandyAngel Thank you for your git annex find tips. But for git gc, it seem not working fine After i have executed the git gc, the git annex info return the result after 1h 45m
git annex info 83,95s user 50,68s system 2% cpu 1:45:01,70 total
Can you explain exactly the git gc or git repack parameters that you use for optimizing git annex performance ?
Thanks
git annex info has check every file (not sure if it traverses .git/annex/objects specifically or not) to get "local annex" information. You can improve its performance by improving directory traversal in general (different filesystem or changing the hashing method so it isn't Xx/Yy/KEY/FILE).
The repack/gc speeds up operations for the git side of things, like syncing (pull/push), cloning and committing.
Here's what I used:
This took git actions down from 1 hour+ to ~10 minutes (for a repo with 5.6 million objects).
Thanks @CandyAngle,
Effectively, your tips for reduce a time for some git-annex commands if works fine, i will see in the long term if that is work perfectly
ex:, now git annex sync it work in 45s !
Thanks
My pleasure, glad it is working for you!
Going forward, you should run git repack (without -ad) every now and again to pack new objects into pack files. You can use git count-objects -v to find out how many unpacked objects you have.
I've just come across something else which has sped up git operations on the index a lot.
My $annex/.git/index was over 500M which made operations on the index really slow (git add/rm etc.) as it gets rewritten out in its entirety, every time!
brought it down to 200M, making everything much, much faster!
There is a warning in the git documentation that alternative git implementations might not understand it, so bear that in mind if you want to use it.