finding duplicate files

Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.

While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
    xargs -d '\n' git rm

--Joey

RSS Atom

Cool

Very nice

Just for reference, here's my Perl implementation. As per this discussion it would be interesting to benchmark these two approaches and see if one is substantially more efficient than the other w.r.t. CPU and memory usage.

Comment by Adam — Fri Dec 23 19:16:50 2011

Remove comment

problems with spaces in filenames

note that the sort -k2 doesn't work right for filenames with spaces in them. On the other hand, git-rm doesn't seem to like the escaped names from escaped_file.

Comment by bremner — Wed Sep 5 02:12:18 2012

Remove comment

problems with spaces in filenames

Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.

#!/bin/bash

function process() {
    dir="$1"
    echo "processing $dir"
    pushd $dir >/dev/null 2>&1

    for fileOrDir in *; do
        nfileOrDir=`echo "$fileOrDir" | sed -e 's/\[//g' -e 's/\]//g' -e 's/ /_/g' -e "s/'//g" `
        if [ "$fileOrDir" != "$nfileOrDir" ]; then
            echo renaming $fileOrDir to $nfileOrDir
            git mv "$fileOrDir" "$nfileOrDir"
        else
            echo "skipping $fileOrDir, no need to rename."
        fi
    done

    find ./ -mindepth 1 -maxdepth 1 -type d | while read d; do
    process "$d"
    done
    popd >/dev/null 2>&1
}

process .

Maybe you can run something like this before checking for duplicates.

Comment by mhameed — Wed Sep 5 08:38:56 2012

Remove comment

more about spaces...

Ironically, previous renaming to remove spaces, plus some synching is how I ended up with these duplicates. For what it is worth, aspiers perl script worked out for me with a small modification. I just only printed out the duplicates with spaces in them (quoted).

Comment by bremner — Sun Sep 9 19:33:01 2012

Remove comment

printing keys first is the easiest workaround

Since the keys are sure to have nos paces in them, putting them first makes working with the output with tools like sort, uniq, and awk simpler.

Comment by Alex — Mon Apr 1 23:32:23 2013

Remove comment

Find files by key

Is there any simple way to search for files with a given key?

At the moment, the best I've come up with is this:

git annex find --include '*' --format='${key} ${file}' | grep <KEY>

where <KEY> is the key. This seems like an awfully longwinded approach, but I don't see anything in the docs indicating a simpler way to do it. Am I missing something?

Comment by Chris — Fri May 3 04:14:55 2013

Remove comment

comment 7

@Chris I guess there's no really easy way because searching for a given key is not something many people need to do.

However, git does provide a way. Try git log --stat -S $KEY

Comment by joey — Mon May 13 18:42:14 2013

Remove comment

This is an awesome feature

Thanks. I have quite a lot of papers in PDF formats. Now I'm saving space, have them controlled, synchronized with many devices and found more than 200 duplicates. Is there a way to donate to the project? You really deserve it. Thanks.

Comment by Juan — Wed Aug 28 13:40:23 2013

Remove comment

comment 9

@Juan the best thing to do is tell people about git-annex, help them use it, and file bug reports. Just generally be part of the git-annex community.

(If you really want to donate to me, http://campaign.joeyh.name/ is still open.)

Comment by joeyh.name — Wed Aug 28 20:25:20 2013

Remove comment

a shell script that handles spaces in file names

I used the following shell pipeline to remove duplicate files in one go:

(1) git annex find --format='${key}:${file}\n' \
(2)    | cut -d '-' -f 4- \
(3)    | sort \
(4)    | uniq --all-repeated=separate -w 40 \
(5)    | awk -vRS= -vFS='\n' '{for (i = 2; i <= NF; i++) print $i}' \
(6)    | cut -d ':' -f 2- \
(7)    | xargs -d '\n' git rm

Generate a list of keys and file names separated by a colon (':').
Cut out the initial part of the key so that the hash is at the beginning of the line. The -f 4- ensures that dashes in the filename do not result in truncation.
Sort the entire list.
Uniquify and print duplicates in groups separated by blank lines. Use the first 40 characters, which matches the length of a SHA1 hash. Other hashes will require a different length.
Use awk to print all but the first line in each group. The empty -vRS sets blank line as the record separator, and the -vFS sets newline as the field separator. The for-loop prints each field except the first.
Cut out the key and keep only the file name by relying on the colon introduced in the first step.
Use xargs to separate file names by newline, which takes care of spaces in the file names. Send this list of arguments to git rm.

Comment by sameerds — Tue Dec 31 10:24:06 2013

Remove comment

comment 12

My method uses Perl to do a lot of the work, cutting out the need to sort and being careful about spaces and such. Below is an (untested) command line version (my version has the perl in ~/bin/annex-dupe.pl):

git annex find --format='${key} ${file}\n' > ~/tmp/annex_kf.txt
perl -pe '($k,$f) = split / /, $_, 2; $cf{$k}++; $_ = sprintf "%s%s\n", ($cf{$k}>1?"":"#", $f;' ~/tmp/annex_kf.txt > ~/tmp/annex_dupes.txt
grep '^#' ~/tmp/annex_dupes.txt | xargs -d'\n' git rm

And the equivalent "one liner":

git annex find --format='${key} ${file}\n' \
| perl -pe '($k,$f) = split / /, $_, 2; $cf{$k}++; $_ = sprintf "%s%s\n", ($cf{$k}>1?"":"#", $f;' \
| grep '^#' \
| xargs -d'\n' git rm

It works by getting a list of keys and paths and passing them to Perl, which prefixes the first instance of each key's path with a '#', which is removed by grep, leaving only duplicate paths being passed to xargs and thus, to 'git rm'.

This can be particularly handy as it lets you delete duplicates from specific subdirectories, just by adding another 'grep DIR/PATH' in front of xargs, without worrying you will lose all references if all instances are in DIR/PATH (because the first one will have been removed from the file list by the first grep!).

For example, after outputting all the duplicates (~/tmp/annex_dupe.txt), I will then do a:

grep '^#' ~/tmp/annex_dupes.txt | grep 'some/sub/dir/somewhere' | xargs -d'\n' git rm
git commit -m "Cleaned up 'some/sub/dir/somewhere'"

loop, if I want more control over where things are removed from.

Comment by CandyAngel — Thu May 21 15:39:07 2015

Remove comment

comment 14

I think it is worth mentioning that in the script I showed above, the "uniqueness" key can be set to anything.

Just recently, I ran a de-duplication where I set such a key to be the key and the file name, making it remove only files that are the same, but don't have different filenames. As I have a lot of files without their proper filenames (e.g. recovered with photorec), this prevents me removing the version with the proper filename and keeping the recovered one.

Very useful (in my sorting-annex case anyway).

Comment by CandyAngel — Fri Jan 15 09:34:40 2016

Remove comment

Files with spaces

For anyone dealing with files with spaces, try this:

git annex find --include '*' --format='${escaped_file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Using escaped_file escapes the filename, which will avoid whitespace so the rest of the pipe commands work correctly. You'll need to deal with the files being escaped in the final output, but you'll see them correctly. This worked for me.

Comment by chris — Tue Feb 19 14:04:04 2019

Remove comment

comment 16

As the key never contains spaces, it is better to have the key first. Then the filename is anything after key(plus separator) up to the newline.

Comment by CandyAngel — Tue Feb 19 14:13:48 2019

Remove comment

Delete duplicates and specify preferred locations

I leave this here for people who understand python. I wrote the output of Joey's first script in file "duplicates". You want to comment out the last line while trying and add some print statements.

from itertools import *
from functools import partial
from pprint import pprint
import subprocess

with open('duplicates', 'rb') as f:
  duplicates = f.read()

duplicates = duplicates.split(b"\n\n")

preferences = b"""XXXXXXXXXXXXXXXXXXXXXXXXX Add more lines below
""".splitlines()

ignore = b"""XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Add more lines below
""".splitlines()

deletes = []

def matches(prefixes, f):
  for prefix in prefixes:
    if f.startswith(prefix):
      return True
  return False

for block in duplicates:
  files = block.splitlines()
  if any(filter(partial(matches, ignore), files)):
    continue

  delete = list(filterfalse(partial(matches, preferences), files))

  if len(delete) + 1 == len(files):
    deletes.extend(delete)

for d in deletes:
  pprint(d)
  subprocess.run([b"git", b"rm", d], capture_output=True, check=True)

Comment by thk — Sun Apr 26 11:18:39 2020

Remove comment

Add a comment