Please describe the problem.
I have a special remote "source" set up as type=directory importtree=yes.
I pulled it into one repo in which none of the files were wanted. So far so good. I cloned that repo to a second, archive, repo. git annex sync worked (but took 2 hours). Then I did git annex get --auto
, most of the files came through OK. But some said things like this:
get Pictures/.dtrash/info/IMG_2979_v1-e9dced7b.dtrashinfo (from source...) (checksum...)
verification of content failed
Unable to access these remotes: source
No other repository is known to contain the file.
failed
(Both repos are unlocked via adjust --unlocked)
Upon looking at the file in the repo where it wasn't wanted, I saw this:
$ cat IMG_2979_v1-e9dced7b.dtrashinfo
/annex/objects/SHA256E-s144--cec5c7b6a9d97344e374e8395e02b74350678147ff65d6df091f5115cf18bf72
Interesting. So, in the source directory:
$ sha256sum IMG_2979_v1-e9dced7b.dtrashinfo
aca3ed7243def7a0bd5fcad542c66841b8e7d2a670b4cafe749eb27e032d8975 IMG_2979_v1-e9dced7b.dtrashinfo
That's not a match at all. Well, OK then:
$ sha256sum * | grep cec5c7
cec5c7b6a9d97344e374e8395e02b74350678147ff65d6df091f5115cf18bf72 IMG_2981_v1-5fc99c7a.dtrashinfo
Yikes. So for IMG_2979_v1-e9dced7b.dtrashinfo, git-annex recorded a checksum that belonged to IMG_2981_v1-5fc99c7a.dtrashinfo. Well then, what is this other file recorded as, back in the git-annex repo?
$ cat IMG_2981_v1-5fc99c7a.dtrashinfo
/annex/objects/SHA256E-s144--cec5c7b6a9d97344e374e8395e02b74350678147ff65d6df091f5115cf18bf72
OK, so two files that were not identical in the source directory got recorded with an identical checksum in git-annex somehow. And, when they were attempted to be imported via git annex get --auto
, this at least was detected there.
In this .dtrash/info directory, 436 files out of 719 were not loaded by git annex get
, presumably due to this issue.
In this directory, the source files were ranging in size from 140 to 227 bytes.
In a companion directory, .dtrash/files, 24 out of 719 files exhibited this issue. These files tended to be larger, but one that was 495MB triggered it also.
I have not yet seen it outside .dtrash, but it will be many hours until this get process completes fully, as it needs to copy about 1TB of data.
In case you are wondering if there is a race condition with .dtrash: no. The only application that writes to it isn't running, and the last time a file was modified in there was over a year ago. Also the content of the .info files is just JSON and a corresponding filename embedded in them, so it is very clear that the files on the filesystem are correct and the calculated checksums at issue here were never correct.
What steps will reproduce the problem?
I have laid that out as best I can above.
What version of git-annex are you using? On what operating system?
8;.20210223 on Debian
Please provide any additional information below.
Assistant is not being used.
Setup:
REPO=Pictures
cd /acrypt/git-annex/repos
mkdir $REPO
cd $REPO
git init
git config annex.thin true
git annex init 'local hub'
git annex wanted . "include=* and exclude=$REPO/*"
# Now initialize things.
touch mtree
git annex add mtree
git annex sync
git annex adjust --unlock
git annex initremote source type=directory directory=/acrypt/git-annex/bind-ro/$REPO importtree=yes encryption=none
git annex enableremote source directory=/acrypt/git-annex/bind-ro/$REPO
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:$REPO
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
done since this has pretty strongly been confirmed to be a bug that was already fixed in a newer git-annex version. --Joey
Concur that you need to upgrade. In particular, this seems likely to have been fixed by 3e2f1f73cbc5fc10475745b3c3133267bd1850a7.
If so, the two files in the source directory will turn out to have the same size and mtime, so were considered to have the same content identifier when importing from a directory with your version of git-annex.
Following up a bit - I had blown away the repos that were in use for this, but based on what I logged here, I can confirm that at least the files I mentioned with conflicts had the same size and mtime.
I may yet want to use ignoreinodes=yes. Eventually I want to be feeding git-annex from zfs clones to get a guaranteed stable source at any given moment, and I'm not really sure what that will do to inodes (it should, at minimum, have a different device number, I would assume).
With the latest v10 git, I'm seeing some things like this:
Now, the file content really had not chnaged. That file looks like this:
And on-disk, the file is:
So, this is something different, but it's producing the same result (files left around not checked out). I obvserve that in .dtrash/info, the directory that had such a problem before, zero files now do. But in .dtrash/files, the problem seems to be triggered by a file that has a duplicate somewhere in the source repository (with the same sha256sum). I was able to find another file with the same sha256sum as that one in the same directory. In the origin repo, in which the given files were not present and it's using unlock-present, all instances of the file contained a broken symlink to the appropriate target. In the repo in question here, which uses plain unlock and is on NTFS (though Linux is the OS accessing it), I'm having this problem.
Well, it seems that you have confirmed that the bug you were seeing is the known and already fixed problem involving files that had the same size and mtime.
I suggest filing a new bug report for the other problem, since it does not seem like the same problem. I don't like to have to re-read long threads about unrelated problems to tease out the "current" problem than a bug report is talking about. 1 bug report per problem saves everyone time and confusion.