This is not really a proper bug report, but I thought I should post this here in case someone can find any sane, non-supernatural reason for a strange case of data loss I have experienced with git-annex.

Some time ago I cloned a bunch of git-annex repos from an external drive (let's call it disk1) to a new computer (computer3). On one of my repos git-annex marked a bunch of files corrupt and moved them to .git/annex/bad. Oops, I thought, I must have a failing disk. Luckily I had offsite backups -- no less than two other external hard disks (disk2-3), each having a full copy of the repo in question. However, both of these had the same, corrupt files. The files have the correct size, but are filled with zeroes. Other files in the repo are fine, and so are other repos.

I have been trying to wrap my head around this but I can't think of any reason how this could occur. However the files have gotten corrupted in the first place, the corruption should have been picked up when copying the content to the external drives disk2 and disk3, right? I have to rule out NSA/MIB/aliens from messing with me because these files are not that valuable or sensitive.

The files in question were added to git-annex back in 2012, so the trail is cold on this one. Naturally, I have no idea on how to reproduce this, nor can I reliably say that git-annex is to blame. I can gather some hints though. The files were all added on the same commit in 2012, but not all files from that commit are corrupted. The corrupted files have consecutive file names. The files were never modified since (except for the corruption), and the content may have been copied via an encrypted rsync transfer repository. I have always used git-annex on Arch Linux and in indirect more. The files used the SHA-1 backend.

All these files have a similar tracking log that looks something like this (uuids replaced with symbolic names):

1356690700.542152s 1 computer1          <- first added
1356691074.253815s 1 disk1              <- copied to disk1
1356719321.145126s 1 rsync              <- copied to rsync repo
1358070999.435676s 1 rsync              <- copied to rsync repo (again?)
1362166895.310332s 1 disk2              <- copied to disk2
1362906850.555869s 1 computer2 (dead)   <- copied to another computer
1364926664.362195s 0 computer1          <- dropped from computer1 as enough copies in disks
1374412057.409496s 0 computer2 (dead)   <- dropped from computer2, now dead
1445691595.764108s 1 disk3              <- copied to disk3
1445770764.165792s 0 rsync              <- dropped from rsync repo to save space
1482077052.217353646s 0 disk1           <- first noticed as corrupted on disk1
1482741278.318274404s 0 disk3           <- WTF, also corrupted on disk3
1482926246.268440532s 0 disk2           <- double-WTF, also corrupted on disk2

The only thing that strikes odd to me is the double entry with the rsync remote. The non-corrupted files from the same commit do not seem to have such a double entry.

So my main question is, has there ever been a bug in git-annex that could have caused this behavior? Or is there any other realistic explanation for this? In case this is an existing bug, is there any other evidence I can gather? Needless to say, the lesson here is to run git annex fsck regularly even if you have offsite backups...

My diagnosis is that the file got corrupted before it was copied to disk2 and disk3. What repository they reached them via does not matter much. And indeed, 5 year old git-annex didn't verify the content of files it transferred to/from a remote. Current git-annex does, so I guess this is done. --Joey