I am storing the pictures I took over the years with git-annex. Frequently I come across messy old directories with lots of pictures inside and I want to know which ones were already annexed - or which ones were not. Is there a quick way to test whether the content of a given file is already annexed? I mean computing the key (hash) of the given file and testing whether it is already present among the annex objects.
Thought to say that you could use 'info' but http://git-annex.branchable.com/bugs/git_annex_info_is_reporting_file_as_not_annexed_in_direct_mode/ . What about 'git annex whereis FILE' ? it should be empty output for non-annexed file
First of all thank you for the answer and the bug report. Unfortunately git-annex whereis seems not to be the answer to my problem, because it works only when queried on already annexed files, while I'd like to test the yet-not-annexed ones. Here's an example using whereis:
Hm.. having some kind of exposure of the key generation code on the command line would actually be pretty useful. So you can do something like:
Probably worth a TODO, really.
Regarding your situation, one way of doing it would be to recursively copy the photo directory as hardlinks,
git annex import --clean-duplicates
the hardlinked copies, then diff the directories. This would give you a list of removals and those removals are already in the repo.Or if you just want to remove them, just run 'git annex import --clean-duplicates' on the original photos directory. NOTE: There was recently an issue with git-annex deleting files that it didn't have any known copies of, so a recent version is highly recommended if using --clean-duplicates.
You can use either
git-annex import --deduplicate
orgit annex import --skip-duplicates
to import files from a directory except for ones already in the repository. The former deletes the duplicate files, and the latter leaves them as-is.I don't think there's currently a good stand-alone way to check if a file is a duplicate of content already in the annex, before adding it. Would need a new command to be added.
Of course, if you simply
git annex add
everything, regular hash based deduplication will point the duplicate file to the same object used by the file when you added it earlier. So, you don't need to worry about adding duplicates wasting much space. They may make your repo more cluttered than you like, is all.See also this tip: finding duplicate files.