git annex import is dangerous if you have unused objects

Please describe the problem.

git annex import is dangerous when you have unused git objects in your git store. You have the potential to lose your filenames and only be left with git objects containing the data.

What steps will reproduce the problem?

echo "foo" > /tmp/foo
echo "bar" > /tmp/bar
echo "baz" > /tmp/baz
cd ~/annex
cp /tmp/{foo,bar,baz} .
git annex add ./{foo,bar,baz}
# I decide I want to abort this particular commit, so I reset
git reset --hard
# At this point, git reset removed the symlinks from our index, but the objects containing the file content still exist in the git store.

# A few days later I decide to import another backup of my data from this location: /tmp/myotherbackup/files/{foo,bar,baz}

# This command removes foo, bar and baz from the source directory, but does not add symlinks for them. They were considered duplicates because we had git objects associated with their content (even though they are unused git objects).
git annex import --deduplicate

# This command considers foo, bar and baz duplicates and removes them from the source directory. We are left with unused objects in the git store, but no idea what the filenames were.
git annex import --clean-duplicates

What version of git-annex are you using? On what operating system?

$ ga version git-annex version: 5.20140717 build flags: Assistant Inotify DBus TDFA key/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SHA256 SHA1 SHA512 SHA224 SHA384 WORM URL remote types: git gcrypt bup directory rsync web glacier ddar hook external local repository version: 5 supported repository version: 5 upgrade supported from repository versions: 0 1 2 4

OS: Fedora 25

Please provide any additional information below.

Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)

I love git-annex! I used it to manage my Pictures.

RSS Atom

comment 1

A simpler way to get to the same end result, without using git-annex import:

mv ~/foo .
git annex add foo
git reset --hard

Now "foo" is only present in the git-annex object store, and we don't know what its filename(s) were.

A way to get to the same end result, without using git-annex:

mv ~/foo .
git add foo
git reset --hard

Now "foo" is only present in the git object store, and we don't know what its filename(s) were.

So, it's not git-annex import, or git-annex add, or git add that is dangerous here. It's git reset --hard.

git will happily lose lots of data until you commit it. Once it's committed, it's safe. That's the rule of thumb. Nothing much that git-annex can do about that.

Comment by joey — Thu Mar 2 21:20:02 2017

Remove comment

comment 2

One way to fix this would be to make git reset --hard first make a commit of the current state of the index, and store that commit in the reflog.

Of course, that's quite similar to git stash and probably most of us have just gotten in the habit of running git stash instead.

Comment by joey — Thu Mar 2 21:26:57 2017

Remove comment

comment 3

Thanks for the insight.

The git stash solution works assuming you are either:

a. Going to keep it in your stash forever b. You are going to commit your stash eventually

I think there are situations where I want to completely abort a commit and not have to worry about it biting me down the road.

IMO from a end user perspective I think the best solution would be to have data only count as duplicate if it has a reachable file in your annex for options --deduplicate, --clean-duplicates and --skip-duplicates of git annex import.

What would be the downside to this?

Worst case scenario this re-wires up some symlinks to once dangling git objects. They still aren't duplicates, there will only be one symlink per formerly dangling git object. This seems better than data loss.

Thoughts?

Comment by mbroadhead — Thu Mar 2 22:48:08 2017

Remove comment

comment 4

I think there are situations where I want to completely abort a commit and not have to worry about it biting me down the road.

If you don't want to have to worry about a git reset --hard biting you down the road the way that it works now, just make sure you clean up after yourself. Example:

cd ~/annex
cp /tmp/foo .
git annex add foo

# Oh, I decided I don't want "foo" in my annex right now, so I do a reset.
# This will leave the data associated with "foo" in a git object in the git
# store.  When running 'git annex import' sometime in the future, this will
# make any files that contain the same data as "foo" to be considered
# duplicate.  This will cause "foo" to be considered a duplicate by git annex
# import, which we don't want in this scenario.
git reset --hard

# In order to avoid "foo" being a duplicate, find the dangling git object:
git annex unused

# And drop it:
git annex dropunused N

# Now "foo" won't be marked as a duplicate if you run any of the following
# commands:
git annex import /tmp/backup/foo --skip-duplicates
git annex import /tmp/backup/foo --clean-duplicates
git annex import /tmp/backup/foo --deduplicate

Comment by mbroadhead — Fri Mar 3 16:59:52 2017

Remove comment

comment 5

The difficulty with checking if the content to be imported is referred to somewhere in the working tree is that there's no inexpensive way to determine that. It would have to run git log -n1 -S$KEY for each file. That can take quite a long time in repositories with a lot of history. I clocked it at 12 seconds per file on an SSD; will be quite a lot slower on a disc.

I suppose that check could be added with a --fast to skip the check.

PS, mbroadhead's is a good approach. Note though that the dropunused content will be considered a duplicate by import since git-annex version 6.20170214. Still, --deduplicate and --clean-duplicates won't delete the files from the import location in this case, since there are no copies of the content in the annex.

Comment by joey — Mon Mar 6 16:25:53 2017

Remove comment

comment 6

To the extent that this is a bug in git-annex import, it's solved by using the new feature of importing a tree from a directory special remote. When used that way, there's no --deduplicate or --clean-duplicates option that causes this problem. Instead it makes git commits tracking the content of the remote directory, and as long as you merge that remote into your master branch, the original filename will be preserved in git.

So once remove legacy import directory interface happens, I guess this could be considered fixed, to the extent it's a bug in git-annex at all and not with using git reset --hard as I showed could lead to the same thing in vanilla git.

Comment by joey — Fri Jan 29 18:38:41 2021

Remove comment

Add a comment