What is hashdeep

hashdeep is a handy tool that allows you to check file integrity across whole directory trees. It can detect renames and missing files, for example.

How to use it with git-annex

The general working principle of hashdeep is that it iterates over a set of files and produces a manifest that looks like this:

$ hashdeep -r *
%%%% HASHDEEP-1.0
%%%% size,md5,sha256,filename
## Invoked from: /home/jessek
## $ hashdeep -r archives bin lib doc

Then this manifest can be used to check consistency of the files later. Because git-annex also uses hashes to identify files, it fits nicely with this pattern and I have used it to verify files that were outside of git-annex's control yet still from the repository. First, we produce the manifest file:

echo '%%%% HASHDEEP-1.0'
echo '%%%% size,sha256,filename'
git annex find --format '${bytesize},${keyname},${file}\n' | sed 's/\.[^,]*,/,/'
) > manifest.txt

Then this can be used to verify an external fileset with the following command:

hashdeep -k manifest.txt -a -vv -e /mnt/ > result

This will create a listing of every file that was moved, that is missing and so on. I have used this to audit corrupted files on my phone's microSD card as it turned out that about half of the files were corrupted for some mysterious reason:

hashdeep: Audit failed
   Input files examined: 0
  Known files expecting: 0
          Files matched: 0
Files partially matched: 0
            Files moved: 3411
        New files found: 2179
  Known files not found: 42117

The non-zero numbers are interesting: 3411 files were detected as being sane and just the filenames had changed. 2179 files were "new" which means that they were not in the original set. Since files were supposed to only come from the original set, this means those files were corrupt. Actually, that's not completely true: some files (JPG image files, namely) were created in the external fileset so I had to be careful to exclude those false positives by hand. The 42117 "known files not found" were files that were simply not transferred over to the phone for lack of space.

This way, I was able to quickly find which files were corrupt and remove them. This created a list of files to remove:

grep 'No match' result  | grep -v '.jpg' | sed 's/: No match$//'

And I used the following loop to remove the files one by one:

grep 'No match' result  | grep -v '.jpg' | sed 's/: No match$//' | while read file; do rm "$file" ; done

Note the above is actually quite dangerous and you might want to insert an echo in there to avoid shenanigans, especially if you do not trust the filesystem.

How else this might work

Naturally, I could have imported all the files into git-annex and work only with git-annex to operate this. But because the files were renamed to some canonical version by the software transferring the file (dSub and Airsonic), it would have been difficult to make a diff with the original set. This is on a (ex)fat filesystem too, which might make git-annex operation difficult. Yet I can't help but think this is something that git-annex-export should be able to do, but I am not sure it could deal with the renames. And I must say I have found it a little inconvenient to have to initremote to be able to use what are essentially ephemeral storage mountpoints.

The above procedure reuses the best of both world: hashdeep does the fuzzy matching and git-annex provides the catalog of files.

Future improvements

It would be nice if git-annex-find would allow listing only the checksum, which would remove a potentially error-prone pattern substitution above (sed 's/\.[^,]*,/,/'). This is necessary because ${keyname} includes the file extension which is expected with the SHA256E backend, but it is somewhat inconvenient to deal with. Of course, it would be pretty awesome if git-annex could output hashdeep-compatible catalogs out of the box: it would improve interoperability here... And the icing on cake would be a git-annex command (a variation of git-annex-import?) that would audit an external, non-annexed repository for consistency in the same way.

Also note that hashdeep can operate in "chunk" mode which means that it can work across file boundaries, detecting partial matches, for example. This is something that, as far as I know, is impossible in git-annex as checksums are only file-based. This would be useful in eliminating the false positives by distinguishing the "this file is completely new" and "this file is corrupt" cases.


Those notes were provided by anarcat but would gladly welcome corrections and improvements.