Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.
While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.
Here's a command line that will show duplicate sets of files grouped together:
git annex find --include '*' --format='${file} ${escaped_key}\n' | \
sort -k2 | uniq --all-repeated=separate -f1 | \
sed 's/ [^ ]*$//'
Here's a command line that will remove one of each duplicate set of files:
git annex find --include '*' --format='${file} ${escaped_key}\n' | \
sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
xargs -d '\n' git rm
--Joey
Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.
Maybe you can run something like this before checking for duplicates.
Is there any simple way to search for files with a given key?
At the moment, the best I've come up with is this:
where
<KEY>
is the key. This seems like an awfully longwinded approach, but I don't see anything in the docs indicating a simpler way to do it. Am I missing something?@Chris I guess there's no really easy way because searching for a given key is not something many people need to do.
However, git does provide a way. Try
git log --stat -S $KEY
Thanks. I have quite a lot of papers in PDF formats. Now I'm saving space, have them controlled, synchronized with many devices and found more than 200 duplicates. Is there a way to donate to the project? You really deserve it. Thanks.
@Juan the best thing to do is tell people about git-annex, help them use it, and file bug reports. Just generally be part of the git-annex community.
(If you really want to donate to me, http://campaign.joeyh.name/ is still open.)
I used the following shell pipeline to remove duplicate files in one go:
-f 4-
ensures that dashes in the filename do not result in truncation.-vRS
sets blank line as the record separator, and the-vFS
sets newline as the field separator. The for-loop prints each field except the first.git rm
.My method uses Perl to do a lot of the work, cutting out the need to sort and being careful about spaces and such. Below is an (untested) command line version (my version has the perl in ~/bin/annex-dupe.pl):
And the equivalent "one liner":
It works by getting a list of keys and paths and passing them to Perl, which prefixes the first instance of each key's path with a '#', which is removed by grep, leaving only duplicate paths being passed to xargs and thus, to 'git rm'.
This can be particularly handy as it lets you delete duplicates from specific subdirectories, just by adding another 'grep DIR/PATH' in front of xargs, without worrying you will lose all references if all instances are in DIR/PATH (because the first one will have been removed from the file list by the first grep!).
For example, after outputting all the duplicates (~/tmp/annex_dupe.txt), I will then do a:
loop, if I want more control over where things are removed from.
I think it is worth mentioning that in the script I showed above, the "uniqueness" key can be set to anything.
Just recently, I ran a de-duplication where I set such a key to be the key and the file name, making it remove only files that are the same, but don't have different filenames. As I have a lot of files without their proper filenames (e.g. recovered with photorec), this prevents me removing the version with the proper filename and keeping the recovered one.
Very useful (in my sorting-annex case anyway).
For anyone dealing with files with spaces, try this:
Using
escaped_file
escapes the filename, which will avoid whitespace so the rest of the pipe commands work correctly. You'll need to deal with the files being escaped in the final output, but you'll see them correctly. This worked for me.I leave this here for people who understand python. I wrote the output of Joey's first script in file "duplicates". You want to comment out the last line while trying and add some print statements.