Hi,
I am been seeing quite big overheads using git-annex
. Is this is normal?
The .git/objects
folder is explosive in my system, often being larger
than the content watched by git-annex. Here's the actual statistics
of my git-annex folders, where the fourth column is calculated as col3/(col2-col3).
folder | size | size .git | relative size |
---|---|---|---|
conf.annex | 777536 | 720100 | 12.537433 |
doc.annex | 20351624 | 11260204 | 1.2385528 |
images.annex | 817064 | 435580 | 1.1418041 |
misc.annex | 803328 | 572476 | 2.4798399 |
music.annex | 23756116 | 9192740 | 0.63122314 |
That is, four of five repos require more space for the .git
folder than the actual files. Most of this comes from the objects
folder.
Number of files:
folder | no. files | no files .git | relative size |
---|---|---|---|
conf.annex | 11350 | 9539 | 5.2672557 |
doc.annex | 84954 | 66824 | 3.6858246 |
images.annex | 92787 | 91285 | 60.775632 |
misc.annex | 95461 | 95160 | 316.14618 |
music.annex | 16414 | 13520 | 4.6717346 |
I use the assistant web interface, and direct mode. I use two laptops running Linux that are synchronized directly over LAN at home or via a transfer repo on a ssh server where git-annex is installed. The latter is set up using the web interface and the gcrypt repo. [Mostly, the transfer repo isn't working and I often end up with only symlinks on the computer where I did not edit the file in question, but this is probably unrelated.]
I have previously tried to fix it using git gc
or git annex forget
, but it doesn't seem to significantly reduce the sizes, and what it helps isn't persistent.
Is this kind of 'overhead' something that one must accept when using
git-annex
or do such numbers indicate that something is wrong?
Thanks.
git annex unused
to see if you have unused files in your repo? From the assistant, the option under configuration -> Unused files gives you an option to expire old files after a period of time so they get deleted from your repo.Thanks for help, Efraim.
I'm not sure this is it. On my other laptop, where the above statics were not calculated the
.git
folder ofdoc.annex
is 26Gb (contents is 8.6Gb). Meanwhile, unused files are 0.6Gb. Inconf.annex
the.git
folder is 3.2Gb, content is 70Mb and unused files is 2.2Mb. I used the web interface to find the size of unused files.Because git-annex tracks all the events of an annexed file for each repo -- added, dropped, copied etc -- and it tracks these in one object per file in the git-annex branch, it does indeed create a lot of objects. To improve both space and performance I made sure to add
git gc --auto
as a post-commit hook, as the objects in my case can quickly reach the tens or even hundreds of thousands.To further improve performance and space, you can choose to set
pack.window
andpack.depth
to vastly higher values than the defaults (10 and 50, respectively), because there is a large amount of objects with very similar content. I did agit repack --window 2500 --depth 1000 -f -a d
and brought down my repo from 3 GiB (packed!) to 300 MiB. Make sure to have a lot of memory and CPU available when doing this, or it will take forever. You can setpack.window
ridiculously high if you like, as long as you limit it withpack.windowMemory
, so that it makes use of all your available memory for comparing objects and finding the optimal delta.Thanks for your tips, Claes. I was really aware of
git repack
and that set of parameters.I didn't mention, but sadly I'd run
git gc
on the repos just before collecting the above numbers.I tried to repack two repositories --
doc.annex
andconfig.annex
-- using the values you suggested. However, it did not have any measurable effect (less than 100mb in both cases).The number of unused files seem to be (much) less than 500 files in the repos.
BTW: All of the extra size is in the
.git/objects/
folder..git/annex/
is quite small (always much less than 1GB). Would that indicate that large files are checked in with git sans annex somehow?git prune
worked wonders on my repos, getting rid of GBs of stuff in the.git/objects
folders. I don't know why they weren't picked up bygit gc
. In retrospect, it was perhaps a bit careless of me to rungit prune
directly, but hopefully I will be OK. . .git prune
only worked as a temporary fix. Mydoc.annex/.git/objects
is 3.6Gb after two days. I don't get whygit
sansannex
is checking in stuff -- which I assume is the reason it's stored in.git/objects
.There are a few things that can cause git to leave unreachable objects. These include: Rebasing; interrupting a pull before it updates the refs; running git add on a file and then changing the file's content and adding it a second time before committing.
I can think of one case where this happens when using git-annex at the command line:
git annex add $file; git mv $file other-directory; git commit
will result in a dangling object storing the old symlink target before the file was moved.It'd be useful to investigate, by using
git fsck --unreachable
to get a list of currently unreachable objects, and then usegit show
to look at the objects and try to determine where they came from. Ie, are they symlink targets or are they git-annex location log files (formatted as columns of timestamps and uuids). Any unreachable commits would be the most useful to investigate.I see a few loose objects here and there in my annexes, but not very many, and git-gc has cleaned up old ones (> 1 month old). Some of them seem to be location log files. I see those in both repositories where I use the assistant, and repositories where I use only command line git-annex. I was able to find 2 unreachable commits in a repository that runs the assistant full-time; both commits were "merging origin/synced/git-annex into git-annex". This suggests to me that perhaps the assistant merged the git-annex branch but that merge was overwritten by another thread that committed changes to the branch at the same time.
You should also check the size of inodes on your system; a thousand small loose objects in .git/objects does not normally take up gigabytes of space; with typical inode sizes it might use up a few megabytes. With 1 mb inodes, those same thousand files would use 1 gb..
Hi Joey,
Thanks for your careful reply.
Easy things first:
I never add anything from the terminal, though I may do checks and
git annex get
, since sometimes the assistance actually grab the updated files. Until recently I started git annex automatically on boot, but at the moment it simply renders my laptop useless for too long -- presumably due to the errors investigated here.I use btrfs (don't ask me why). Searching online, I did not find a way to find the size of inodes, but I assume that it's sensible? tune2fs doesn't work but as I understand it is designed for ext*.
What takes up space in my
.git/objects
is files of several Mb. So at the moment thepack
folder is 700mb. In the next biggest folder there's three files that are 73,4mb and 8 files that are 4kb. This pattern repeats. A couple of large files (73,4 shows up quite a bit as well as 45) and many small files.I have an astonishing amount of dangling objects. In the
doc.annex
git rev-list HEAD --count
gives 27354. In this repo I have 1108 unreachable blobs and commits, respectively 569 and 539. This probably explains whygit prune
solves my problem but I don't understand why all these large files reappears when I sync -- even after having rungit prune
on both laptops. Could they come from theannex
on my remote server?git show
isn't nice on blobs, but here is an example of a dangling commitThere are several things I don't understand. Why is the author root? I never run
git annex
withsudo
or as root. I think the date is bogus. I'm pretty sure I wasn't even runninggit annex
in 2012 much less working with this repo. . . What is weird is that this is the date for all lost commits! (Same for Author). Over all lost commits there are 2352 binary files that differ. Of these there are 284 unique hashes. . . I don't know what this means other than my repo being seriously messed up. I don't understand what I did wrong to end up in this state as I have been fairly careful in mainly using thewebapp
.I wonder if the best way to proceed is to start over, or whether this repo can be recovered.
Thanks, Rasmus
That is a very strange commit by every metric. Weird author, weird date, weird filenames in it (not files that git-annex uses!), with apparently some weird binary content (which git-annex would not be committing). Even a weird commit message -- git-annex never makes a commit with a message of "Initial commit", and as far as I can tell using
git log -S
, it never has. (OTOH, it's a pretty common example message used in eg, git documentation.) So, I feel pretty sure that dangling commit was not made by git-annex.I think you need to take a look at some of the 4+mb unreachable blobs, to get some idea of what these files are. One way is to use git-show on the hash of one of the blobs to get its content, and then, perhaps pass it to
file
orstrings
. Or, you could stop the assistant,git checkout 478425bef867782e8ff22aca24316e9421288c49
and have a look at this strange tree that was apparently committed in 2012 to see what's in there.It might be possible that the dangling commits come somehow from the remote server. I'm not 100% sure, but I think that a git pack can end up with dangling objects in it, and then git can pull down that pack to get other, non-dangling objects. You should use
git show
on the server on some of the dangling shas to see if they are present there.In the meantime, I've been looking over the Annex.Branch code.
stageJournal
is only ever called in code paths that commit the updated index, so those code paths cannot result in dangling objects unless git-annex is interrupted before it can commit. (This may explain some of my own repos having a few dangling refs, that were not commits; I could have ctrl-c'd git-annex.)It's possible for a forced update of the local git-annex branch, done by eg a push from another repo, to overwrite a commit made to it. In this case, the git-annex index is merged with the branch, resulting in a new commit, and the old commit that was overwritten will indeed be dangling. However,
git annex sync
doesn't overwrite the git-annex branch; it pushes to synced/git-annex, or does ataggedPush
to a private ref. It is the case that both those pushes are forced pushes, so can overwrite a branch ref and leave the old commit it pointed to dangling. In the case oftaggedPush
, the old commit should be a parent of the new, so it won't dangle. In the case of synced/git-annex being overwritten, the old commit could dangle, but only until whatever repo pushed it syncs again, at which time it should get incorporated as one of the parents of the new synced/git-annex it pushes. So, I don't see how long-term dangling commits could happen this way, except for in the case where a repository stops syncing/goes missing/rebases its git-annex branch (ie, git-annex forget is used). (This may explain the 2 dangling commits I found on elephant; we did delete some clones of that repository recently.)At this point I'm not convinced that the dangling objects I found in my own repos are due to some systematic problem, the above seems like it could explain them, and the above is not a problem on the class of the one Rasmus is having. Of course, it's hard to be sure you've spotted all possible ways that a resource leak can happen, and that's what these dangling objects basically are.
Hi Joey,
Thanks for giving the thread a more appropriate title and thanks for the helpful messages.
Let me start with the easy points:
etckeeper
on my system. So unless it could have entered throughannex
then I think we can rule that one out.According to
git log
the repos are from January 2014 where I restarted my repos.When I start git repos I typically just use "init" so I don't think I did the 2012 commits.
file test.blob
it showstest.blob: GPG symmetrically encrypted data (CAST5 cipher)
. But none of my normal passwords worked. Could such a gpg'ed file be from local network connections where the assistant asks for a passphrase? I'm pretty sure that my transfer repo has only been usinggcrypt
and I believe I "restarted" my repos because I switched togcrypt
repos. Also, my transfer repo is 10Gb as well which sounds big for transfer repo.I performed a similar "analysis" on the
conf.annex
repo which should contain mostly no binary files (some 16x16 pngs etc).conf.annex
has 727 unreachable objects and 3477 commits in total. Of these 338 are commits. Here's an example of a larger commit message of an unreachable commit.Across all commits 6006 objects are mentioned, but only 371 are unique.
I checked out one blob and again
file
reportsGPG symmetrically encrypted data (CAST5 cipher)
. Interesting forconf.annex
I get this line when trying to decryptFor
doc.annex
I getAnd on my other computer I see a third ID. I'm not sure if this means anything when files are symmetrically encrypted, though.
These objects are the ones written by git-remote-gcrypt when pushing to a remote. That's why the weird dates, root pseudo-commit, crazy filenames, and big gpg encrypted blobs. All countermeasures that git-remote-gcrypt uses to keep your encrypted git remote safe and not leak information about what's in it.
So, this is a bug in git-remote-gcrypt. It needs to clean these objects up after pushing them! (Also after failed pushes.)
Brilliant! Thanks for taking time to analyze the issue and taking the bug to
gcrypt
.[I'm surprised that a different key than my git-annex key is used and that it's a symmetric key, but I will explore the technology on my own].