unreachable git objects

Hi,

I am been seeing quite big overheads using git-annex. Is this is normal?

The .git/objects folder is explosive in my system, often being larger than the content watched by git-annex. Here's the actual statistics of my git-annex folders, where the fourth column is calculated as col3/(col2-col3).

folder	size	size .git	relative size
conf.annex	777536	720100	12.537433
doc.annex	20351624	11260204	1.2385528
images.annex	817064	435580	1.1418041
misc.annex	803328	572476	2.4798399
music.annex	23756116	9192740	0.63122314

That is, four of five repos require more space for the .git folder than the actual files. Most of this comes from the objects folder.

Number of files:

folder	no. files	no files .git	relative size
conf.annex	11350	9539	5.2672557
doc.annex	84954	66824	3.6858246
images.annex	92787	91285	60.775632
misc.annex	95461	95160	316.14618
music.annex	16414	13520	4.6717346

I use the assistant web interface, and direct mode. I use two laptops running Linux that are synchronized directly over LAN at home or via a transfer repo on a ssh server where git-annex is installed. The latter is set up using the web interface and the gcrypt repo. [Mostly, the transfer repo isn't working and I often end up with only symlinks on the computer where I did not edit the file in question, but this is probably unrelated.]

I have previously tried to fix it using git gc or git annex forget, but it doesn't seem to significantly reduce the sizes, and what it helps isn't persistent.

Is this kind of 'overhead' something that one must accept when using git-annex or do such numbers indicate that something is wrong?

Thanks.

RSS Atom

comment 1

have you tried from the command line git annex unused to see if you have unused files in your repo? From the assistant, the option under configuration -> Unused files gives you an option to expire old files after a period of time so they get deleted from your repo.

Comment by Efraim — Sun Sep 7 14:09:05 2014

Remove comment

comment 2

Thanks for help, Efraim.

I'm not sure this is it. On my other laptop, where the above statics were not calculated the .git folder of doc.annex is 26Gb (contents is 8.6Gb). Meanwhile, unused files are 0.6Gb. In conf.annex the .git folder is 3.2Gb, content is 70Mb and unused files is 2.2Mb. I used the web interface to find the size of unused files.

Comment by rasmus — Sun Sep 7 15:17:25 2014

Remove comment

repack parameters

Because git-annex tracks all the events of an annexed file for each repo -- added, dropped, copied etc -- and it tracks these in one object per file in the git-annex branch, it does indeed create a lot of objects. To improve both space and performance I made sure to add git gc --auto as a post-commit hook, as the objects in my case can quickly reach the tens or even hundreds of thousands.

To further improve performance and space, you can choose to set pack.window and pack.depth to vastly higher values than the defaults (10 and 50, respectively), because there is a large amount of objects with very similar content. I did a git repack --window 2500 --depth 1000 -f -a d and brought down my repo from 3 GiB (packed!) to 300 MiB. Make sure to have a lot of memory and CPU available when doing this, or it will take forever. You can set pack.window ridiculously high if you like, as long as you limit it with pack.windowMemory, so that it makes use of all your available memory for comparing objects and finding the optimal delta.

Comment by Claes — Sun Sep 7 21:53:10 2014

Remove comment

Re: repack parameters

Thanks for your tips, Claes. I was really aware of git repack and that set of parameters.

I didn't mention, but sadly I'd run git gc on the repos just before collecting the above numbers.

I tried to repack two repositories -- doc.annex and config.annex -- using the values you suggested. However, it did not have any measurable effect (less than 100mb in both cases).

The number of unused files seem to be (much) less than 500 files in the repos.

BTW: All of the extra size is in the .git/objects/ folder. .git/annex/ is quite small (always much less than 1GB). Would that indicate that large files are checked in with git sans annex somehow?

Comment by rasmus — Mon Sep 8 13:20:36 2014

Remove comment

comment 5

So git prune worked wonders on my repos, getting rid of GBs of stuff in the .git/objects folders. I don't know why they weren't picked up by git gc. In retrospect, it was perhaps a bit careless of me to run git prune directly, but hopefully I will be OK. . .

Comment by rasmus — Mon Sep 8 13:48:03 2014

Remove comment

comment 6

Seems git prune only worked as a temporary fix. My doc.annex/.git/objects is 3.6Gb after two days. I don't get why git sans annex is checking in stuff -- which I assume is the reason it's stored in .git/objects.

Comment by rasmus — Wed Sep 10 12:07:53 2014

Remove comment

comment 7

There are a few things that can cause git to leave unreachable objects. These include: Rebasing; interrupting a pull before it updates the refs; running git add on a file and then changing the file's content and adding it a second time before committing.

I can think of one case where this happens when using git-annex at the command line: git annex add $file; git mv $file other-directory; git commit will result in a dangling object storing the old symlink target before the file was moved.

It'd be useful to investigate, by using git fsck --unreachable to get a list of currently unreachable objects, and then use git show to look at the objects and try to determine where they came from. Ie, are they symlink targets or are they git-annex location log files (formatted as columns of timestamps and uuids). Any unreachable commits would be the most useful to investigate.

I see a few loose objects here and there in my annexes, but not very many, and git-gc has cleaned up old ones (> 1 month old). Some of them seem to be location log files. I see those in both repositories where I use the assistant, and repositories where I use only command line git-annex. I was able to find 2 unreachable commits in a repository that runs the assistant full-time; both commits were "merging origin/synced/git-annex into git-annex". This suggests to me that perhaps the assistant merged the git-annex branch but that merge was overwritten by another thread that committed changes to the branch at the same time.

You should also check the size of inodes on your system; a thousand small loose objects in .git/objects does not normally take up gigabytes of space; with typical inode sizes it might use up a few megabytes. With 1 mb inodes, those same thousand files would use 1 gb..

Comment by joeyh.name — Wed Sep 17 20:20:40 2014

Remove comment

comment 8

Hi Joey,

Thanks for your careful reply.

Easy things first:

I never add anything from the terminal, though I may do checks and git annex get, since sometimes the assistance actually grab the updated files. Until recently I started git annex automatically on boot, but at the moment it simply renders my laptop useless for too long -- presumably due to the errors investigated here.

I use btrfs (don't ask me why). Searching online, I did not find a way to find the size of inodes, but I assume that it's sensible? tune2fs doesn't work but as I understand it is designed for ext*.

What takes up space in my .git/objects is files of several Mb. So at the moment the pack folder is 700mb. In the next biggest folder there's three files that are 73,4mb and 8 files that are 4kb. This pattern repeats. A couple of large files (73,4 shows up quite a bit as well as 45) and many small files.

I have an astonishing amount of dangling objects. In the doc.annex git rev-list HEAD --count gives 27354. In this repo I have 1108 unreachable blobs and commits, respectively 569 and 539. This probably explains why git prune solves my problem but I don't understand why all these large files reappears when I sync -- even after having run git prune on both laptops. Could they come from the annex on my remote server?

git show isn't nice on blobs, but here is an example of a dangling commit

commit 478425bef867782e8ff22aca24316e9421288c49
Author: root <root@localhost>
Date:   Mon Dec 31 19:00:01 2012 -0400

    Initial commit

diff --git a/6e5039464b41f39088a4aece64ced787aa2b04ec2dd5ac6f6c6ca4b9a06a99e5 b/6e5039464b41f39088a4aece64ced787aa2b04ec2dd5ac6f6c6ca4b9a06a99e5
new file mode 100644
index 0000000..af12763
Binary files /dev/null and b/6e5039464b41f39088a4aece64ced787aa2b04ec2dd5ac6f6c6ca4b9a06a99e5 differ
diff --git a/8ae4ee273eb540fb71b78152d10010ea2dd3d1bb82afe410ecf3d811cb72bd6d b/8ae4ee273eb540fb71b78152d10010ea2dd3d1bb82afe410ecf3d811cb72bd6d
new file mode 100644
index 0000000..0a6af91
Binary files /dev/null and b/8ae4ee273eb540fb71b78152d10010ea2dd3d1bb82afe410ecf3d811cb72bd6d differ
diff --git a/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a b/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a
new file mode 100644
index 0000000..26d921e
Binary files /dev/null and b/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a differ
diff --git a/9f7728197cfcd9792eef1ff5930a4ab580e38e64291037130f1ad0914e34a1fc b/9f7728197cfcd9792eef1ff5930a4ab580e38e64291037130f1ad0914e34a1fc
new file mode 100644
index 0000000..2a92974
Binary files /dev/null and b/9f7728197cfcd9792eef1ff5930a4ab580e38e64291037130f1ad0914e34a1fc differ
diff --git a/ac801235d97275e761efa12a76ee009472cae8549a0835d5be8bd3f6657047fb b/ac801235d97275e761efa12a76ee009472cae8549a0835d5be8bd3f6657047fb
new file mode 100644
index 0000000..543430c
Binary files /dev/null and b/ac801235d97275e761efa12a76ee009472cae8549a0835d5be8bd3f6657047fb differ
diff --git a/d400d0f616a980ea5e3ef68a1f9d670d1eeccbd27f34d1cb7ea976e1f98e2fb7 b/d400d0f616a980ea5e3ef68a1f9d670d1eeccbd27f34d1cb7ea976e1f98e2fb7
new file mode 100644
index 0000000..7b7eadd
Binary files /dev/null and b/d400d0f616a980ea5e3ef68a1f9d670d1eeccbd27f34d1cb7ea976e1f98e2fb7 differ
diff --git a/e988a26fbabe3f498e2a564096948eafb289ccadfb186423c1f63c5a3b2c19db b/e988a26fbabe3f498e2a564096948eafb289ccadfb186423c1f63c5a3b2c19db
new file mode 100644
index 0000000..3bd1dfa
Binary files /dev/null and b/e988a26fbabe3f498e2a564096948eafb289ccadfb186423c1f63c5a3b2c19db differ

There are several things I don't understand. Why is the author root? I never run git annex with sudo or as root. I think the date is bogus. I'm pretty sure I wasn't even running git annex in 2012 much less working with this repo. . . What is weird is that this is the date for all lost commits! (Same for Author). Over all lost commits there are 2352 binary files that differ. Of these there are 284 unique hashes. . . I don't know what this means other than my repo being seriously messed up. I don't understand what I did wrong to end up in this state as I have been fairly careful in mainly using the webapp.

I wonder if the best way to proceed is to start over, or whether this repo can be recovered.

Thanks, Rasmus

Comment by rasmus — Thu Sep 18 11:28:45 2014

Remove comment

comment 9

That is a very strange commit by every metric. Weird author, weird date, weird filenames in it (not files that git-annex uses!), with apparently some weird binary content (which git-annex would not be committing). Even a weird commit message -- git-annex never makes a commit with a message of "Initial commit", and as far as I can tell using git log -S, it never has. (OTOH, it's a pretty common example message used in eg, git documentation.) So, I feel pretty sure that dangling commit was not made by git-annex.

I think you need to take a look at some of the 4+mb unreachable blobs, to get some idea of what these files are. One way is to use git-show on the hash of one of the blobs to get its content, and then, perhaps pass it to file or strings. Or, you could stop the assistant, git checkout 478425bef867782e8ff22aca24316e9421288c49 and have a look at this strange tree that was apparently committed in 2012 to see what's in there.

It might be possible that the dangling commits come somehow from the remote server. I'm not 100% sure, but I think that a git pack can end up with dangling objects in it, and then git can pull down that pack to get other, non-dangling objects. You should use git show on the server on some of the dangling shas to see if they are present there.

Comment by joeyh.name — Thu Sep 18 16:54:10 2014

Remove comment

comment 10

In the meantime, I've been looking over the Annex.Branch code.

stageJournal is only ever called in code paths that commit the updated index, so those code paths cannot result in dangling objects unless git-annex is interrupted before it can commit. (This may explain some of my own repos having a few dangling refs, that were not commits; I could have ctrl-c'd git-annex.)

It's possible for a forced update of the local git-annex branch, done by eg a push from another repo, to overwrite a commit made to it. In this case, the git-annex index is merged with the branch, resulting in a new commit, and the old commit that was overwritten will indeed be dangling. However, git annex sync doesn't overwrite the git-annex branch; it pushes to synced/git-annex, or does a taggedPush to a private ref. It is the case that both those pushes are forced pushes, so can overwrite a branch ref and leave the old commit it pointed to dangling. In the case of taggedPush, the old commit should be a parent of the new, so it won't dangle. In the case of synced/git-annex being overwritten, the old commit could dangle, but only until whatever repo pushed it syncs again, at which time it should get incorporated as one of the parents of the new synced/git-annex it pushes. So, I don't see how long-term dangling commits could happen this way, except for in the case where a repository stops syncing/goes missing/rebases its git-annex branch (ie, git-annex forget is used). (This may explain the 2 dangling commits I found on elephant; we did delete some clones of that repository recently.)

At this point I'm not convinced that the dangling objects I found in my own repos are due to some systematic problem, the above seems like it could explain them, and the above is not a problem on the class of the one Rasmus is having. Of course, it's hard to be sure you've spotted all possible ways that a resource leak can happen, and that's what these dangling objects basically are.

Comment by joeyh.name — Thu Sep 18 17:27:36 2014

Remove comment

comment 11

I knew I had used "Initial commit" somewhere ... etckeeper uses that message. And commits as root. Could an etckeeper repo have somehow gotten merged into your git-annex repo? Seems strange, and the filenames and contents don't really look like /etc to me, but it otherwise somewhat fits.

Comment by joeyh.name — Thu Sep 18 17:32:09 2014

Remove comment

comment 12

Hi Joey,

Thanks for giving the thread a more appropriate title and thanks for the helpful messages.

Let me start with the easy points:

Looking at my log file of installed packages I have never used etckeeper on my system. So unless it could have entered through annex then I think we can rule that one out.

According to git log the repos are from January 2014 where I restarted my repos.

  commit 029a8e76ab5f66aa4390987130985550a1ccd69c
  Author: Rasmus <w530@domain.eu>
  Date:   Thu Jan 23 21:06:13 2014 +0100

  created repository

When I start git repos I typically just use "init" so I don't think I did the 2012 commits.
I checked out one of the 74mb files. When I do file test.blob it shows test.blob: GPG symmetrically encrypted data (CAST5 cipher). But none of my normal passwords worked. Could such a gpg'ed file be from local network connections where the assistant asks for a passphrase? I'm pretty sure that my transfer repo has only been using gcrypt and I believe I "restarted" my repos because I switched to gcrypt repos. Also, my transfer repo is 10Gb as well which sounds big for transfer repo.

I performed a similar "analysis" on the conf.annex repo which should contain mostly no binary files (some 16x16 pngs etc).

conf.annex has 727 unreachable objects and 3477 commits in total. Of these 338 are commits. Here's an example of a larger commit message of an unreachable commit.

commit 601c10f9512e8d3502d9dd52ef409560ebb5b7e0
Author: root <root@localhost>
Date:   Mon Dec 31 19:00:01 2012 -0400

     Initial commit

 diff --git a/6fbbea493cdec9d912d256374199cc4c012022d35524c8789a7aceeb953442a5 b/6fbbea493cdec9d912d256374199cc4c012022d35524c8789a7aceeb953442a5
 new file mode 100644
 index 0000000..ea5fcc3
 Binary files /dev/null and b/6fbbea493cdec9d912d256374199cc4c012022d35524c8789a7aceeb953442a5 differ
 diff --git a/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a b/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a
 new file mode 100644
 index 0000000..a86c1a9
 Binary files /dev/null and b/91bd0c092128cf2e60e1a608c31e92caf1f9c1595f83f2890ef17c0e4881aa0a differ
 diff --git a/9da3fcfc1635c674012c35d90c21adce3c35440e629d64fe117fe349a6b3e194 b/9da3fcfc1635c674012c35d90c21adce3c35440e629d64fe117fe349a6b3e194
 new file mode 100644
 index 0000000..ef1d71c
 Binary files /dev/null and b/9da3fcfc1635c674012c35d90c21adce3c35440e629d64fe117fe349a6b3e194 differ
 diff --git a/ad4ae79c29b3756f7e41257db7454f3c319112d06385a8bc12d28209a82f2594 b/ad4ae79c29b3756f7e41257db7454f3c319112d06385a8bc12d28209a82f2594
 new file mode 100644
 index 0000000..61d3e5b
 Binary files /dev/null and b/ad4ae79c29b3756f7e41257db7454f3c319112d06385a8bc12d28209a82f2594 differ
 diff --git a/bd0e9cb492077e0c090bc62892c8de438c51a956c8215b2c68de7caa7e2431cc b/bd0e9cb492077e0c090bc62892c8de438c51a956c8215b2c68de7caa7e2431cc
 new file mode 100644
 index 0000000..92e9bd7
 Binary files /dev/null and b/bd0e9cb492077e0c090bc62892c8de438c51a956c8215b2c68de7caa7e2431cc differ

Across all commits 6006 objects are mentioned, but only 371 are unique.

I checked out one blob and again file reports GPG symmetrically encrypted data (CAST5 cipher). Interesting for conf.annex I get this line when trying to decrypt

gpg: DBG: cleared passphrase cached with ID: SBF83A0F822D0F664

For doc.annex I get

gpg: DBG: cleared passphrase cached with ID: S32DEAD1E8DD06A4D

And on my other computer I see a third ID. I'm not sure if this means anything when files are symmetrically encrypted, though.

Comment by rasmus — Fri Sep 19 00:43:56 2014

Remove comment

I know what it is now

These objects are the ones written by git-remote-gcrypt when pushing to a remote. That's why the weird dates, root pseudo-commit, crazy filenames, and big gpg encrypted blobs. All countermeasures that git-remote-gcrypt uses to keep your encrypted git remote safe and not leak information about what's in it.

So, this is a bug in git-remote-gcrypt. It needs to clean these objects up after pushing them! (Also after failed pushes.)

Comment by joeyh.name — Fri Sep 19 02:43:22 2014

Remove comment

comment 14

https://github.com/bluss/git-remote-gcrypt/issues/16

Comment by joeyh.name — Fri Sep 19 02:59:36 2014

Remove comment

comment 15

Brilliant! Thanks for taking time to analyze the issue and taking the bug to gcrypt.

[I'm surprised that a different key than my git-annex key is used and that it's a symmetric key, but I will explore the technology on my own].

Comment by rasmus — Fri Sep 19 06:29:58 2014

Remove comment

Add a comment