Please describe the problem.
git annex import
is dangerous when you have unused git objects in your git store. You have the potential to lose your filenames and only be left with git objects containing the data.
What steps will reproduce the problem?
echo "foo" > /tmp/foo
echo "bar" > /tmp/bar
echo "baz" > /tmp/baz
cd ~/annex
cp /tmp/{foo,bar,baz} .
git annex add ./{foo,bar,baz}
# I decide I want to abort this particular commit, so I reset
git reset --hard
# At this point, git reset removed the symlinks from our index, but the objects containing the file content still exist in the git store.
# A few days later I decide to import another backup of my data from this location: /tmp/myotherbackup/files/{foo,bar,baz}
# This command removes foo, bar and baz from the source directory, but does not add symlinks for them. They were considered duplicates because we had git objects associated with their content (even though they are unused git objects).
git annex import --deduplicate
# This command considers foo, bar and baz duplicates and removes them from the source directory. We are left with unused objects in the git store, but no idea what the filenames were.
git annex import --clean-duplicates
What version of git-annex are you using? On what operating system?
$ ga version git-annex version: 5.20140717 build flags: Assistant Inotify DBus TDFA key/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SHA256 SHA1 SHA512 SHA224 SHA384 WORM URL remote types: git gcrypt bup directory rsync web glacier ddar hook external local repository version: 5 supported repository version: 5 upgrade supported from repository versions: 0 1 2 4
OS: Fedora 25
Please provide any additional information below.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I love git-annex! I used it to manage my Pictures.
A simpler way to get to the same end result, without using git-annex import:
Now "foo" is only present in the git-annex object store, and we don't know what its filename(s) were.
A way to get to the same end result, without using git-annex:
Now "foo" is only present in the git object store, and we don't know what its filename(s) were.
So, it's not
git-annex import
, orgit-annex add
, orgit add
that is dangerous here. It'sgit reset --hard
.git will happily lose lots of data until you commit it. Once it's committed, it's safe. That's the rule of thumb. Nothing much that git-annex can do about that.
One way to fix this would be to make
git reset --hard
first make a commit of the current state of the index, and store that commit in the reflog.Of course, that's quite similar to
git stash
and probably most of us have just gotten in the habit of runninggit stash
instead.Thanks for the insight.
The
git stash
solution works assuming you are either:a. Going to keep it in your stash forever b. You are going to commit your stash eventually
I think there are situations where I want to completely abort a commit and not have to worry about it biting me down the road.
IMO from a end user perspective I think the best solution would be to have data only count as duplicate if it has a reachable file in your annex for options
--deduplicate
,--clean-duplicates
and--skip-duplicates
ofgit annex import
.What would be the downside to this?
Worst case scenario this re-wires up some symlinks to once dangling git objects. They still aren't duplicates, there will only be one symlink per formerly dangling git object. This seems better than data loss.
Thoughts?
If you don't want to have to worry about a
git reset --hard
biting you down the road the way that it works now, just make sure you clean up after yourself. Example:The difficulty with checking if the content to be imported is referred to somewhere in the working tree is that there's no inexpensive way to determine that. It would have to run
git log -n1 -S$KEY
for each file. That can take quite a long time in repositories with a lot of history. I clocked it at 12 seconds per file on an SSD; will be quite a lot slower on a disc.I suppose that check could be added with a --fast to skip the check.
PS, mbroadhead's is a good approach. Note though that the dropunused content will be considered a duplicate by import since git-annex version 6.20170214. Still, --deduplicate and --clean-duplicates won't delete the files from the import location in this case, since there are no copies of the content in the annex.
To the extent that this is a bug in git-annex import, it's solved by using the new feature of importing a tree from a directory special remote. When used that way, there's no --deduplicate or --clean-duplicates option that causes this problem. Instead it makes git commits tracking the content of the remote directory, and as long as you merge that remote into your master branch, the original filename will be preserved in git.
So once remove legacy import directory interface happens, I guess this could be considered fixed, to the extent it's a bug in git-annex at all and not with using
git reset --hard
as I showed could lead to the same thing in vanilla git.