Please describe the problem.
we have dandiarchive s3 bucket with versioning turned on. Currently, after I changed signature from anonymous and added region it looks like
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git show git-annex:remote.log
09b87154-c650-46d1-a036-6e03c56c0b1a bucket=dandiarchive datacenter=US encryption=none fileprefix=dandisets/ host=s3.amazonaws.com importtree=yes name=s3-dandiarchive port=80 publicurl=https://dandiarchive.s3.amazonaws.com/ region=us-east-2 signature=v4 storageclass=STANDARD type=S3 timestamp=1764626152s
Bucket has "trailing delete" enabled since awhile (years).
Originally it was all open and we were importing on cron, the last merge was
Date: 2025 Aug 27 21:23:09 -0400
Merge remote-tracking branch 's3-dandiarchive/master'
Recently-ish (sep/oct) policy got updated so some keys on s3 became protected and require authentication. We had a good number of failing due to 403 runs, including ones where I already specified AWS credentials but still had signature=anonymous and no region specified. Then (yesterday) I specified signature to be v4, had a run where it complained about region needing to be us-east-2 instead of us-east-1 (not sure why could not deduce automagically), so I specified it too. And then the import run seem to proceeded fine!
But git merge then failed:
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git merge s3-dandiarchive/master
error: unable to read sha1 file of 000029/draft/dandiset.jsonld (f7c097994e60c2b58dae464633583b65a6691415)
error: unable to read sha1 file of 000029/draft/dandiset.yaml (1fa7abf602b540507c1a31e20da3d687e83ebfe6)
error: unable to read sha1 file of 000338/draft/assets.jsonld (4ad13ca757df0b39f2c20af47e5d3c9140ccfc7b)
error: unable to read sha1 file of 000338/draft/assets.yaml (08cca54d889faffc76c7911f5c700eb09c22e628)
error: unable to read sha1 file of 000338/draft/collection.jsonld (cf60b31aca7826a8d4993828e439af1f808cb17e)
...
and git fsck fails loudly with many blobs missing etc
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ head .duct/logs/2025.12.02T08.19.22-3737239_stdout
broken link from tree 8c233f531c125ef0edbba48300d7c2ca914c1dac
to blob 513d0a3ba28460f1c7db74b2f4b4905a9942d903
broken link from tree 8c233f531c125ef0edbba48300d7c2ca914c1dac
to blob 2d3e42dc7935b136141f81f3113a6eac247aa570
broken link from tree 8c233f531c125ef0edbba48300d7c2ca914c1dac
to blob e88e9ef106f8c7cdce43378079416ab353593335
...
and also similar errors while trying to git log a sample file there:
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git log s3-dandiarchive/master -- 000029/draft/dandiset.jsonld
commit 2fc1ff12
Author: DANDI Team <team@dandiarchive.org>
Date: 2025 Dec 01 16:56:17 -0500
import from s3-dandiarchive
commit 65c4ea5b
Author: DANDI Team <team@dandiarchive.org>
Date: 2025 Apr 24 16:23:07 -0400
import from s3-dandiarchive
commit 832893d3
Author: DANDI Team <team@dandiarchive.org>
Date: 2025 Apr 24 13:21:10 -0400
import from s3-dandiarchive
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git log -p s3-dandiarchive/master -- 000029/draft/dandiset.jsonld
fatal: unable to read f7c097994e60c2b58dae464633583b65a6691415
commit 2fc1ff12
Author: DANDI Team <team@dandiarchive.org>
Date: 2025 Dec 01 16:56:17 -0500
import from s3-dandiarchive
as the fail on the recently imported version, suggests that it is git-annex not importing correctly somehow?
I believe this was done with this version:
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ source ~/git-annexes/static-10.20250416.sh
dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ git annex version | head
git-annex version: 10.20250416-static1
build flags: Pairing DBus DesktopNotify TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.4 bloomfilter-2.0.1.2 crypton-1.0.4 DAV-1.3.4 feed-1.3.2.1 ghc-9.8.4 http-client-0.7.19 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.16
...
please advise on how to mitigate (git reset --hard the s3-dandiarchive/master to prior state before yesterday and reimport with newer git-annex or ... ?)
Originally all keys in the bucket
What steps will reproduce the problem?
What version of git-annex are you using? On what operating system?
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
I have now tried with most recent release 10.20251114-geeb21b831e7c45078bd9447ec2b0532a691fe471 while operating on a copy from the backup.
and looking at the fact that it starts with the latter, likely the "access restricted ones"
while still making commits to earlier folders
I suspect it just somehow "manufactures" them for public ones without fetching their keys?
Are the
000675/draft/files you show it importing the ones that are access restricted?And when you replicated the problem from the backup, were you using it in the configuration where it cannot access those?
I notice that all the files affected seem to be probably smallish text files (yaml, jsonld). Do you have annex.largefiles configured in this repository, and are all of the affected files non-annexed files? If so, it would be worth retrying from the backup with the config changed so those files get annexed and see if that avoids the problem.
Simply resetting the remote tracking branch and re-importing won't cause an import to necessarily happen again. This is because git-annex tracks internally what has been imported from the remote. Running an import again when it's already imported files won't re-download those same files. And it will regenerate the same remote tracking branch.
So running in a clone from a backup is a better way to re-run the import.
yes -- that one is embargoed (can be seen by going to https://dandiarchive.org/dandiset/000675)
if I got the question right and since I do not recall now -- judging from me using
( source .git/secrets.env; git-annex import master...I think I was with credentials allowing to access them (hence no errors while importing)yes
and it seems all go into git
is empty
All being small files does make me think this bug is somehow specfic to adding the files to git. So it would be very useful to re-run the reproducer again, with annex.largefiles this time configured so everything is annexed.
Well that's why I asked. It's not clear to me if it ever did show a failure, when used in the configuration where it couldn't access the files.
It seems equally likely that it somehow incorrectly thought it succeeded.
I was able to set up this same special remote myself (manually populating remote.log) and use with my own S3 creds (which of course have no special access rights to this bucket so it was all public access only), importing into a fresh repository.
Part of that import included:
But, the import ended with:
And did not create a branch, so I have not been able to reproduce the problem.
Digging into why it says "ok" there, that was unfortunately only a display problem. Corrected that.
One way I can see that this might happen is if
git-annex forgethas been used, after a previous export/import.In that case, the content identifier database would be populated with a GIT key, which would be used instead of downloading the file to be imported. Resulting in a git sha being used, which could not be present in the git repository. Because while the git-annex branch usually gets imported/exported trees linked into it,
git-annex forgeterases that.So a possible scenario:
That is worth trying to replicate. But it seems pretty unlikely to me that is what you actually did ...?
Leaving aside the possibility that
git hash-objectmight be buggy and not record the object in the git repository, that's the only way I can find for this to possibly happen, after staring at the code for far too long.I was wrong,
git-annex forgetcannot cause this, since 8e7dc958d20861a91562918e24e071f70d34cf5b in 8.20210428 made exported tree grafts be preserved through a forget.This leaves me with no scenario that might cause this problem. Unless a git-annex version older than that were used.
I've reverted 69e6c4d024dcff7c2f8ea1a2ed3b483a86b2cc7d which I had made to guard against the
git-annex forgetscenario, since it would slow down imports of trees that contain a lot of small files.It still seems possible that commit would have avoided the problem, but until I understand what actually caused the problem, I don't want to unncessarily slow git-annex down with an unverified fix for it.
I think that a previous, failed import from the remote, run in a different clone of the repository than the import that later fails, could have caused the problem.
My thinking is, while import is downloading files, the content identifiers get recorded in the git-annex branch. Only once the import is complete does the imported tree get grafted into the git-annex branch. So, if the import fails (or is interrupted), this can leave content identifiers in the log. The git blobs for small files have already been stored in git, but no tree references them. If that git-annex branch gets pushed, then in a separate clone of the repository, running the import again would see those content identifiers. But the git blobs referenced by them would not have been pushed, and so would not be available.
We already know that the import was failing due to the S3 permissions, so the only other thing that would have been needed is for the git-annex branch to be pushed to origin, and then this same import tried later in a different clone.
@yarikoptic does this seem plausibly what could have happened?
Replicated this problem as follows:
importKeysto fail at the endGITkeysimportKeysto usual behaviorresult:
Verified that 69e6c4d024dcff7c2f8ea1a2ed3b483a86b2cc7d does in fact avoid this problem. Running steps 9 and 10 with that commit results in a non-broken repository.
Yay, solved!