Please describe the problem.
With the latest v10 git, I'm seeing some things like this:
get Pictures/0_New/5/20-03-31 21-12-12 0184.mp4 (from source...) file content has changed
Unable to access these remotes: source
No other repository is known to contain the file.
failed
Now, the file content really had not chnaged. That file looks like this:
$ cat 20-03-31\ 21-12-12\ 0184.mp4
/annex/objects/SHA256E-s55936559--0d418e4bb62cfc7ef5b053f8b622dd72794781a49931abc41bb9499acaf51b09.mp4
And on-disk, the file is:
$ sha256sum 20-03-31\ 21-12-12\ 0184.mp4
0d418e4bb62cfc7ef5b053f8b622dd72794781a49931abc41bb9499acaf51b09 20-03-31 21-12-12 0184.mp4
The result here is that files are left around not checked out.
This is in an unlocked, thin repository and it is pulling in data from a directory special remote with importtree=yes.
The problem seems to be triggered by a file that has a duplicate somewhere in the source repository (with the same sha256sum). I was able to find another file with the same sha256sum as that one in the same directory. In the origin repo, in which the given files were not present and it's using unlock-present, all instances of the file contained a broken symlink to the appropriate target. In the repo in question here, which uses plain unlock and is on NTFS (though Linux is the OS accessing it), I'm having this problem.
What steps will reproduce the problem?
Any command that causes it to load the file content from the special remote (eg, git annex import, or git annex get)
What version of git-annex are you using? On what operating system?
10.20220823 on Debian bullseye
Please provide any additional information below.
Original conversation was at Files recorded with other file's checksums.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Yes! I used it for awhile with the assistant to sync a lot of files.
"file content has changed" is a message from the directory special remote when it sees that a file in the directory has a different mtime,size,inode tuple (content identifier) than the one that was recorded when importing the tree.
Maybe that has something to do with the duplicate files, since they would have different inodes. Except, it does try to support that; it can keep track of multiple content identifiers for a key, and when there are duplicate files, that works in my testing.
I notice that, if I have a directory special remote that was initially set up with ignoreinodes=no, and I've imported a tree that way from it, I can cause what looks like this problem:
Which seems like a bug that it allows changing it, but I think that the idea was that, after changing ignoreinores, the user would re-run git-annex import, which would re-import all the files since the content identifier has changed. Still, I think that 3e2f1f73cbc5fc10475745b3c3133267bd1850a7 didn't consider that it could cause a get to fail.
I know you have recently upgraded from a version that defaulted to ignoring inodes. Maybe you only need to re-run
git-annex import
from the directory remote to fix the problem?If I was on the wrong track with my first comment, it would be helpful to take a look at the content identifier that is recorded for the key, by running:
The part after the UUID in the output is the content identifier(s). If more than one, they are separated by colons. The format of a content identifier is "inode size mtime", and the inode will be 0 if it was generated with ignoreinodes=yes. So you could then compare the content identifiers with the stat of the files in the directory special remote and see if something has actually changed.
I've made the directory special remote treat content identifiers that differ only in one having the inode set to 0. Which will avoid it failing in the situation I showed.
Examining that
git show
command:From the source material:
So it looks like the inode number changed. But, on a hunch, I thought: this only happens with duplicate files; what if there's a file with that sha256 and that inode number?
And sure enough:
I didn't set ignoreinodes on this remote.
Is that helpful?
The two files have the same two inodes that are recorded in the git-annex branch. The size and mtime also seem to match. Interesting, it must be another problem than the one I fixed.
I wonder if it's expecting the first file to have the second one's inode, or vice-versa?
I don't know how that would have happened to you, but I was able to force it to happen, by importing a tree from a directory remote that had 2 duplicate files. Then I swapped the names of the 2 files in the directory. This causes git-annex get to fail with the same error message.
That must be a bug itself, because re-importing from the special remote generates the same tree (of course) and so get continues failing, there is no recovery possible except deleting or touching the files.
Aha: Another indication that handling of duplicate imported files is broken is to import two files, and then delete the first of them. This also causes git-annex get to fail, but now it always fails with "no such file or directory". It is only trying to get the first file, it does not try the second. Re-importing from the special remote teaches it about the deleted file, and that leads back to the "file content has changed" problem because it seems to be looking for the deleted file's inode. So that's one way you could have gotten into this situation without the unlikely swapping of two duplicate files.
Ah, here's the smoking gun: In Remote.Helper.ExportImport, it gets the cids for a key. And promptly picks the first one to pass to retrieveExportWithContentIdentifier, ignoring all the rest. It also gets the first recorded export location and likewise passes it to retrieveExportWithContentIdentifier.
It seems like a fix would be to try retrieveExportWithContentIdentifier with each combination of cid and export location. But that would cause an
O(N^2)
explosion and it's possible a remote has say 1000 empty files in it.Maybe instead make retrieveExportWithContentIdentifier take a list of valid cids, and accept any one of them? Then it would only need to be tried on each export location in turn until one succeeds.
(Hm, removeExportWithContentIdentifier and checkPresentExportWithContentIdentifier already use a list, so similar problems are avoided with them.)