I have a directory with 6TB of data in it. I tried to use git annex to back it up to three 3TB drives, I didn't want to use RAID as it sucks, and I didn't want to use tar as I wanted my files easily available.
I added my remotes successfully, then I ran git annex add .
That mostly worked, although it understandably took ages, although it missed several GB of files here and there.
Next I tried to do git commit -a -m added
, hoping that this would copy all of my files to the remotes. It didn't it just died with the error
fatal: No HEAD commit to compare with (yet)
fatal: No HEAD commit to compare with (yet)
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
So I freaked out and decided to undo the mess and just go with tar instead, since at this point every git command takes multiple minutes and fails with the same error as above.
I tried to run git anne unannex .
, but I got this error:
unannex GWAS/by-download-number/27081.log.gz fatal: No HEAD commit to compare with (yet)
So now I can't do anything without committing the files it seems, and I somehow need to grow the git cache, although when I search online for `+RTS -Ksize -RTS', I get nothing.
Does anyone know how to increase the cache size, or how to unannex the files without this HEAD error?
Thanks,
Mike
OK, this is a little frustrating. I found this post from three years ago: http://git-annex.branchable.com/forum/Problems_with_large_numbers_of_files/ and I decided to try a newer version of git-annex.
I uninstalled ghc and haskell from Scientific Linux because all of these Red Hat based distros have ancient packages.
I installed the latest git from source, the latest ghc linux x86_64 binary, and then the latest haskell platform from source. Then I used cabal to install all dependencies for git annex with
cabal install --only-dependencies git-annex
. Finally I installed git annex from source.I then tried to run
git-annex add .
in my directory and got the following error:Ok, I was hoping that the latest version would just work, no luck. So I did what it told me to:
git-annex +RTS -K1000000 -RTS add .
That gave the error:
git-annex: Most RTS options are disabled. Link with -rtsopts to enable them.
Grr.
So I went into the Makefile and added the line
-rtsopts -with-rtsopts="-K1000m"
after every call to ghc I could find. I also addedghc-options: -with-rtsopts="-K100000"
to my ~/.cabal/config file.Now when I run
make
I get this error:Do I have to manually compile the entire Haskell Platform with the -rtsopts flag in order to get this to work?
I can't find any easy-to-follow information anywhere that shows me how to just increase the memory limit. My server has 48 cores, 192GB of memory, over 1TB of scratch space, and over 60TB of storage. I really want to be able to use git-annex to easily move files from our large RAID arrays onto archive drives, and be able to intelligently get that data back whenever I want, I don't understand why I am being limited to 8MB of memory for this.
Any advice would be fantastic, thank you.
Hi Joeyh,
Thanks for the reply. I am using git version 2.0.0.390.gcb682f8, not sure what version of git-annex, but I downloaded it from github about 20 minutes ago.
Thanks!
-Mike
You can find out the version of git-annex by running: git-annex version
You can find out if your repository is in direct or indirect mode by running: git config annex.direct
git-annex version
returns:git config annex.direct
exits with error code 1 and doesn't return any information, however I never explicitly set direct mode, and the repository is all symlinked, so my assumption is that it is in indirect mode. Would direct mode be better for such a large repo?Ok, so the repository is in indirect mode, and this rules out a large quantity of problems that could have been caused by direct mode (no, I don't recommend using direct mode).
If you want to build git-annex with the +RTS option enabled, you just need to pass -rtsopts to ghc when building git-annex. (Not -with-rtsopts ...) That might let you pump up the memory and bypass whatever the problem is, or at least find out how much memory it's trying to allocate, which might be a useful clue. But I would be much more interested in debugging and fixing the actual problem, since git-annex should not normally need to allocate a 8+ mb chunk of memory.
The "No HEAD commit to compare with (yet)" failure mode was removed from git in 2011. You must have been using old versions of git and git-annex before you upgraded. Perhaps they have left the repository in some broken state.
What size does
du -hsc .git/objects
report? How aboutdu -h .git/index
?Are git commands that do not involve git-annex still taking a long time to run or failing in some way? (Note that
git commit
has a hook that runs git-annex; you can bypass that withgit commit --no-verify
)The git available through yum is git 1.7.1, which looks like it was released in 2010 or earlier. (I really wish I had a different version of linux on this server). It is possible that in some way screwed up the repo.
I figured out how to compile cabal and git-annex with rtsopts, so I can now set higher memory levels, but I am happy to help debug the problem too, as I would really love a fully functional git-annex.
git commands now run quickly, thanks to the new git I think.
du -hsc .git/objects
returns:8.1G .git/objects
du -h .git/index
returns:437M .git/index
I am currently running the command
git-annex +RTS -K1000m -RTS add .
, it is chugging away doing something, but is not printing any messages yet after 11 minutes of running, it is a 6TB directory though, and there are a lot of concurrent IO operations on that disk right now.I am also running
du -h --max-depth=1
on the root repo directory, and alsofind | wc -l
, so that I can tell you the exact size of the dir and the total number of files too. These operations combined may take more than an hour though, I will send details when the commands complete.Let me know if you want me to stop the
git-annex +RTS -K1000m -RTS add .
command and run git-annex some other way.Both of those du's look extremely large. How many files are listed by
git ls-files --cached | wc -l
?I don't think that there's any point in running
git annex add
while you're still having some problem. I am curious though how much memory the git-annex add you have running has used.If I were you, I'd look in .git/objects for large files (> 100kb, say).
git ls-files --cached | wc -l
returns: 1882028As far as I can tell, the largest objects in .git/objects are 65kb, there are just a bunch of them (257). Also, my repo contains 1,886,125 files and directories total, most in a single directory (after git annex add completed, that one directory contained 8.3GB of symlinks.
git-annex add .
just completed successfully, I am now runninggit commit -a -m added
, it is chugging away and taking its time.Is there an obvious upper limit to the number of files or the total size of files that git annex can handle? For example, is 1 million files too many? How about 6TB? or 9TB? For this repo I think I have a little less than 2m files, and the total size of the repo is greater than 6TB. Is that too much? Should I split it into multiple repos?
I also have a question just about the utility of git-annex for this purpose. I don't need to backup this data, I just want to have it off the big hard drive and onto a bunch of small drives. I have added 3 4TB drives as remotes and I want all of the data stored on them, I will take them offline and put them in a safe. Ideally my file and directory structure will remain intact as symlinks, and then when I want to access a file in the future, I can run
git annex get <file>
, connect the drive that git annex tells me to, and then get that file, use it, and then drop it when I am done. From all of my reading it seems like that is a good usage for git annex, but I want to check with you and see if that makes sense to you. Also, can I just rungit annex drop --auto --numcopies=1
to get git annex to move all of the files to my remote repositories?Thanks for all of your help, and let me know if there are any other debug steps you would like me to run. I am still waiting for git commit to run, and for an exact repo size for you.
Ok, my suspicion is that the root problem is having a large number of file in a single directory. That would cause the git tree objects to get big, and it may be that git-annex somewhere buffers a whole tree object in memory, although I cannot think of where off the top of my head.
git-annex scales to any size of files (limited only by checksumming time unless you use the WORM backend to avoid checksumming). git-annex tries to scale at least as good as git does to a large quantity of files in a repository. git doesn't handle a million files in a repository very fast, due to a number of issues including how the index works. I have never tested git-annex with more than 1 million files, and not all in the same directory.
Other than the number of files, your use case seems reasonable.
git annex drop
will drop files that you have already copied to enough of the remotes (using eg,git-annex copy
).Above you show a git-annex add failing after 5 files. I suspect you truncated that output, and it processed rather more files. git-annex only says "(Recording state in git...)" once it's added all the files, or after it's added around 10 thousand files and still has more to do). It seems to have failed at the point where the files are staged into the index.
I'm building a 2 million file in one directory repo on a fast server now to see if I can reproduce this.
Yes, you are right.
git-annex add
got through almost all of the files in the first run, which I did a week ago, I am not sure how long it took, several days I think (which is fine, time isn't that important here).I re-ran
git-annex add .
yesterday after having trouble withgit commit
, which is when I uncovered these problems. I think you are right that the problem appears when git-annex is staging files into the index. No problems occurred during checksumming or moving files.Also, the repo isn't as large as I thought, it if 4.1TB, so it makes sense that the issue is number of files, not files size.
git add
andgit commit
are now working fine, all git operations (e.g.git status
) are now taking around 30s to 1 min, which is acceptable.I am going to try and move the data to the remotes now. Is there anything special I need to do since the remotes are smaller than the current repo? The remotes are just single drives with ext4 filesystems and an empty repo on them. I ideally want to fill each drive as much as possible, and have the current repo contain no files, how do I do that? Can I just run
git-annex move --to mito_backup1
and then when it is full run a second command ofgit-annex move --to mito_backup2
. Is it better to usegit-annex copy
instead ofmove
and then usedrop
after?Thanks!
It will be faster to use
git annex move
, assuming you want to only have 1 copy of each file, and not more. git-annex will stop storing files on a drive once it gets close to full (annex.diskreserve), and you can safely interrupt it and switch to the next drive.Do you have any special git configuration? In particular I'm curious about any annex.queuesize setting, which if set to something really high would make
git annex add
buffer a lot of filenames and stage them all at once. (However, I just noticed that annex.queuesize didn't cause as large a queue to be used as intended, so it would need to have been set to some really enormous value to run it out of stack space.)Also, see scalability.
As far as I know I am using the defaults, I didn't customize anything. I am thinking of switching to the WORM backend though as I think it will make things a little faster, but I haven't done that yet.
Also, I actually compiled cabal-install with the ghc flag
-rtsopts
and git-annex with flags-rtsopts -with-rtsopts=-K1000m
. Due to the amount of memory available, I am not worried if git-annex leaks memory and uses 1GB of memory during operations, but I have been watching it with htop, and its memory usage is usually very small.Everything seems to be working fine now, but I have another question:
Is there any way to speed up the copying of many small files? It looks like git-annex is calling rsync for each individual file, which is very fast for large files, but on my directories with many small files, the total speed is working on to just a few MB an hour - it has only transferred 1GB in the last 4 hours.
I am using the WORM backed with the
-b WORM
flag, but I wonder if there is a different move method implemented? For example many calls tomv
will be much faster than many calls torsync
.Thanks!
git-annex could be made to always use cp for local transfers, see Remote/Git.hs rsyncOrCopyFile and change
ifM (sameDeviceIds src dest) (docopy, dorsync)
to justdocopy
However, I doubt that will be a significant speedup. It's more likely that the overhead around copying a file and updating the location tracking etc adds up with millions of small files.
You can manually move files and use
git annex fsck
, but it is not likely to be any faster.After letting a 2 million file import run while I was away on vacation, I came back to it, and it indeed ran out of memory: