Recover files, annex stuck

I have a directory with 6TB of data in it. I tried to use git annex to back it up to three 3TB drives, I didn't want to use RAID as it sucks, and I didn't want to use tar as I wanted my files easily available.

I added my remotes successfully, then I ran git annex add .

That mostly worked, although it understandably took ages, although it missed several GB of files here and there.

Next I tried to do git commit -a -m added, hoping that this would copy all of my files to the remotes. It didn't it just died with the error

fatal: No HEAD commit to compare with (yet)
fatal: No HEAD commit to compare with (yet)
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

So I freaked out and decided to undo the mess and just go with tar instead, since at this point every git command takes multiple minutes and fails with the same error as above.

I tried to run git anne unannex ., but I got this error:

unannex GWAS/by-download-number/27081.log.gz fatal: No HEAD commit to compare with (yet)

So now I can't do anything without committing the files it seems, and I somehow need to grow the git cache, although when I search online for `+RTS -Ksize -RTS', I get nothing.

Does anyone know how to increase the cache size, or how to unannex the files without this HEAD error?

Thanks,

Mike

RSS Atom

comment 1

What version of git-annex are you using? And what version of git? Is your git-annex repository in direct mode or indirect mode?

Comment by joeyh.name — Wed Jun 18 16:22:13 2014

Remove comment

Frustrating

OK, this is a little frustrating. I found this post from three years ago: http://git-annex.branchable.com/forum/Problems_with_large_numbers_of_files/ and I decided to try a newer version of git-annex.

I uninstalled ghc and haskell from Scientific Linux because all of these Red Hat based distros have ancient packages.

I installed the latest git from source, the latest ghc linux x86_64 binary, and then the latest haskell platform from source. Then I used cabal to install all dependencies for git annex with cabal install --only-dependencies git-annex. Finally I installed git annex from source.

I then tried to run git-annex add . in my directory and got the following error:

#> git annex add .
add AllStudies.txt.csv.gz ok
add GWAS/action.txt.gz ok
add GWAS/archive/dump2.txt.gz ok
add GWAS/archive/dump3.txt.gz ok
add GWAS/by-download-number/27080.log.gz ok
(Recording state in git...)
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

Ok, I was hoping that the latest version would just work, no luck. So I did what it told me to:

git-annex +RTS -K1000000 -RTS add .

That gave the error:

git-annex: Most RTS options are disabled. Link with -rtsopts to enable them.

Grr.

So I went into the Makefile and added the line -rtsopts -with-rtsopts="-K1000m" after every call to ghc I could find. I also added ghc-options: -with-rtsopts="-K100000" to my ~/.cabal/config file.

Now when I run make I get this error:

if [ "cabal " = ./Setup ]; then ghc --make Setup; fi
cabal  configure
cabal: Most RTS options are disabled. Link with -rtsopts to enable them.
make: *** [Build/SysConfig.hs] Error 1

Do I have to manually compile the entire Haskell Platform with the -rtsopts flag in order to get this to work?

I can't find any easy-to-follow information anywhere that shows me how to just increase the memory limit. My server has 48 cores, 192GB of memory, over 1TB of scratch space, and over 60TB of storage. I really want to be able to use git-annex to easily move files from our large RAID arrays onto archive drives, and be able to intelligently get that data back whenever I want, I don't understand why I am being limited to 8MB of memory for this.

Any advice would be fantastic, thank you.

Comment by Mike — Wed Jun 18 16:29:31 2014

Remove comment

Version

Hi Joeyh,

Thanks for the reply. I am using git version 2.0.0.390.gcb682f8, not sure what version of git-annex, but I downloaded it from github about 20 minutes ago.

Thanks!

-Mike

Comment by Mike — Wed Jun 18 16:32:32 2014

Remove comment

comment 4

You can find out the version of git-annex by running: git-annex version

You can find out if your repository is in direct or indirect mode by running: git config annex.direct

Comment by joeyh.name — Wed Jun 18 16:35:29 2014

Remove comment

Versions

git-annex version returns:

git-annex version: 5.20140618-gc2f1c63 
build flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash
key/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL
remote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external
local repository version: unknown
supported repository version: 5
upgrade supported from repository versions: 0 1 2 4

git config annex.direct exits with error code 1 and doesn't return any information, however I never explicitly set direct mode, and the repository is all symlinked, so my assumption is that it is in indirect mode. Would direct mode be better for such a large repo?

Comment by Mike — Wed Jun 18 16:40:14 2014

Remove comment

Reinstall GHC or Cabal?

Do I need to reinstall ghc or cabal with rtsopts enabled somehow in order to be able to compile git-annex with -K1000m?

Comment by Mike — Wed Jun 18 16:49:35 2014

Remove comment

comment 7

Ok, so the repository is in indirect mode, and this rules out a large quantity of problems that could have been caused by direct mode (no, I don't recommend using direct mode).

If you want to build git-annex with the +RTS option enabled, you just need to pass -rtsopts to ghc when building git-annex. (Not -with-rtsopts ...) That might let you pump up the memory and bypass whatever the problem is, or at least find out how much memory it's trying to allocate, which might be a useful clue. But I would be much more interested in debugging and fixing the actual problem, since git-annex should not normally need to allocate a 8+ mb chunk of memory.

The "No HEAD commit to compare with (yet)" failure mode was removed from git in 2011. You must have been using old versions of git and git-annex before you upgraded. Perhaps they have left the repository in some broken state.

What size does du -hsc .git/objects report? How about du -h .git/index?

Are git commands that do not involve git-annex still taking a long time to run or failing in some way? (Note that git commit has a hook that runs git-annex; you can bypass that with git commit --no-verify)

Comment by joeyh.name — Wed Jun 18 17:14:40 2014

Remove comment

comment 8

The git available through yum is git 1.7.1, which looks like it was released in 2010 or earlier. (I really wish I had a different version of linux on this server). It is possible that in some way screwed up the repo.

I figured out how to compile cabal and git-annex with rtsopts, so I can now set higher memory levels, but I am happy to help debug the problem too, as I would really love a fully functional git-annex.

git commands now run quickly, thanks to the new git I think.

du -hsc .git/objects returns: 8.1G .git/objects

du -h .git/index returns: 437M .git/index

I am currently running the command git-annex +RTS -K1000m -RTS add ., it is chugging away doing something, but is not printing any messages yet after 11 minutes of running, it is a 6TB directory though, and there are a lot of concurrent IO operations on that disk right now.

I am also running du -h --max-depth=1 on the root repo directory, and also find | wc -l, so that I can tell you the exact size of the dir and the total number of files too. These operations combined may take more than an hour though, I will send details when the commands complete.

Let me know if you want me to stop the git-annex +RTS -K1000m -RTS add . command and run git-annex some other way.

Comment by Mike — Wed Jun 18 17:31:58 2014

Remove comment

comment 9

Both of those du's look extremely large. How many files are listed by git ls-files --cached | wc -l ?

I don't think that there's any point in running git annex add while you're still having some problem. I am curious though how much memory the git-annex add you have running has used.

If I were you, I'd look in .git/objects for large files (> 100kb, say).

Comment by joeyh.name — Wed Jun 18 17:36:44 2014

Remove comment

comment 10

git ls-files --cached | wc -l returns: 1882028

As far as I can tell, the largest objects in .git/objects are 65kb, there are just a bunch of them (257). Also, my repo contains 1,886,125 files and directories total, most in a single directory (after git annex add completed, that one directory contained 8.3GB of symlinks.

git-annex add . just completed successfully, I am now running git commit -a -m added, it is chugging away and taking its time.

Is there an obvious upper limit to the number of files or the total size of files that git annex can handle? For example, is 1 million files too many? How about 6TB? or 9TB? For this repo I think I have a little less than 2m files, and the total size of the repo is greater than 6TB. Is that too much? Should I split it into multiple repos?

I also have a question just about the utility of git-annex for this purpose. I don't need to backup this data, I just want to have it off the big hard drive and onto a bunch of small drives. I have added 3 4TB drives as remotes and I want all of the data stored on them, I will take them offline and put them in a safe. Ideally my file and directory structure will remain intact as symlinks, and then when I want to access a file in the future, I can run git annex get <file>, connect the drive that git annex tells me to, and then get that file, use it, and then drop it when I am done. From all of my reading it seems like that is a good usage for git annex, but I want to check with you and see if that makes sense to you. Also, can I just run git annex drop --auto --numcopies=1 to get git annex to move all of the files to my remote repositories?

Thanks for all of your help, and let me know if there are any other debug steps you would like me to run. I am still waiting for git commit to run, and for an exact repo size for you.

Comment by Mike — Wed Jun 18 20:15:42 2014

Remove comment

comment 11

Ok, my suspicion is that the root problem is having a large number of file in a single directory. That would cause the git tree objects to get big, and it may be that git-annex somewhere buffers a whole tree object in memory, although I cannot think of where off the top of my head.

git-annex scales to any size of files (limited only by checksumming time unless you use the WORM backend to avoid checksumming). git-annex tries to scale at least as good as git does to a large quantity of files in a repository. git doesn't handle a million files in a repository very fast, due to a number of issues including how the index works. I have never tested git-annex with more than 1 million files, and not all in the same directory.

Other than the number of files, your use case seems reasonable. git annex drop will drop files that you have already copied to enough of the remotes (using eg, git-annex copy).

Above you show a git-annex add failing after 5 files. I suspect you truncated that output, and it processed rather more files. git-annex only says "(Recording state in git...)" once it's added all the files, or after it's added around 10 thousand files and still has more to do). It seems to have failed at the point where the files are staged into the index.

I'm building a 2 million file in one directory repo on a fast server now to see if I can reproduce this.

Comment by joeyh.name — Wed Jun 18 20:54:31 2014

Remove comment

comment 12

Yes, you are right. git-annex add got through almost all of the files in the first run, which I did a week ago, I am not sure how long it took, several days I think (which is fine, time isn't that important here).

I re-ran git-annex add . yesterday after having trouble with git commit, which is when I uncovered these problems. I think you are right that the problem appears when git-annex is staging files into the index. No problems occurred during checksumming or moving files.

Also, the repo isn't as large as I thought, it if 4.1TB, so it makes sense that the issue is number of files, not files size.

git add and git commit are now working fine, all git operations (e.g. git status) are now taking around 30s to 1 min, which is acceptable.

I am going to try and move the data to the remotes now. Is there anything special I need to do since the remotes are smaller than the current repo? The remotes are just single drives with ext4 filesystems and an empty repo on them. I ideally want to fill each drive as much as possible, and have the current repo contain no files, how do I do that? Can I just run git-annex move --to mito_backup1 and then when it is full run a second command of git-annex move --to mito_backup2. Is it better to use git-annex copy instead of move and then use drop after?

Thanks!

Comment by Mike — Wed Jun 18 22:28:27 2014

Remove comment

comment 13

It will be faster to use git annex move, assuming you want to only have 1 copy of each file, and not more. git-annex will stop storing files on a drive once it gets close to full (annex.diskreserve), and you can safely interrupt it and switch to the next drive.

Do you have any special git configuration? In particular I'm curious about any annex.queuesize setting, which if set to something really high would make git annex add buffer a lot of filenames and stage them all at once. (However, I just noticed that annex.queuesize didn't cause as large a queue to be used as intended, so it would need to have been set to some really enormous value to run it out of stack space.)

Also, see scalability.

Comment by joeyh.name — Wed Jun 18 22:43:08 2014

Remove comment

comment 14

As far as I know I am using the defaults, I didn't customize anything. I am thinking of switching to the WORM backend though as I think it will make things a little faster, but I haven't done that yet.

Also, I actually compiled cabal-install with the ghc flag -rtsopts and git-annex with flags -rtsopts -with-rtsopts=-K1000m. Due to the amount of memory available, I am not worried if git-annex leaks memory and uses 1GB of memory during operations, but I have been watching it with htop, and its memory usage is usually very small.

Comment by Mike — Wed Jun 18 22:51:28 2014

Remove comment

comment 15

Everything seems to be working fine now, but I have another question:

Is there any way to speed up the copying of many small files? It looks like git-annex is calling rsync for each individual file, which is very fast for large files, but on my directories with many small files, the total speed is working on to just a few MB an hour - it has only transferred 1GB in the last 4 hours.

I am using the WORM backed with the -b WORM flag, but I wonder if there is a different move method implemented? For example many calls to mv will be much faster than many calls to rsync.

Thanks!

Comment by Mike — Thu Jun 19 21:19:01 2014

Remove comment

comment 16

git-annex could be made to always use cp for local transfers, see Remote/Git.hs rsyncOrCopyFile and change ifM (sameDeviceIds src dest) (docopy, dorsync) to just docopy

However, I doubt that will be a significant speedup. It's more likely that the overhead around copying a file and updating the location tracking etc adds up with millions of small files.

Comment by joeyh.name — Thu Jun 19 22:16:21 2014

Remove comment

comment 17

Would git-annex be able to detect if I manually moved some files? At this point it looks like the transfer will take multiple weeks... if I just moved the files from .git/annex directly from one repo to the other I could make the transfer significantly faster, but would that corrupt the repo?

Comment by Mike — Sat Jun 21 01:56:48 2014

Remove comment

comment 18

You can manually move files and use git annex fsck, but it is not likely to be any faster.

After letting a 2 million file import run while I was away on vacation, I came back to it, and it indeed ran out of memory:

add 999996 ok
add 999997 ok
(Recording state in git...)
[2014-06-21 11:49:28 JEST] feed: xargs ["-0","git","--git-dir=/home/joey/tmp/r/.git","--work-tree=/home/joey/tmp/r","add","--"]
add 999998 ok
add 999999 ok
[2014-06-21 11:49:49 JEST] read: git ["--git-dir=/home/joey/tmp/r/.git","--work-tree=/home/joey/tmp/r","diff","--name-only","--diff-filter=T","-z","--","."]
(Recording state in git...)
[2014-06-21 11:52:24 JEST] feed: xargs ["-0","git","--git-dir=/home/joey/tmp/r/.git","--work-tree=/home/joey/tmp/r","add","--"]
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.