https://github.com/OpenNeuroOrg/openneuro/issues/3446#issuecomment-2892398583 This is a case where a truncated file was exported as part of a tree to S3. In particular a bucket with versioning=yes.

Note that git-annex export does not verify checksums before sending, and so it's possible for this to happen if a corrupted object has somehow gotten into the local repository. It might be possible to improve this to deal better with object corruption, including object corruption that occurs while exporting.

Currently there is no good way for a user to recover from this. Exporting a tree that deletes the corrupted file, followed by a tree that adds back the right version of the file will generally work. But it will not work for a versioned S3 bucket, because removing an export from a versioned S3 bucket does not remove the recorded S3 versionId. While re-exporting the file will record the new versionId, the old one remains recorded, and when multiple versionIds are recorded for the same key, either may be used when retrieving it.

What needs to be done is to remove the old versionId. But it does not seem right to generally do this when removing an exported file from a S3 bucket, because usually, when it's not corrupted, that versionId is still valid, and can still be used to retrieve that object.

git-annex fsck --from=s3 will detect the problem, but it is unable to do anything to resolve it, since it can only try to drop the corrupted key, and dropping by key is not supported with an exporttree=yes remote.

Could fsck be extended to handle this? It should be possible for fsck to:

removeExport the corrupted file, and update the export log to say that the export of the tree to the special remote is incomplete.
Handle the special case of the versioned S3 bucket with eg, a new Remote method that is used when a key on the remote is corrupted. In the case of a versioned S3 bucket, that new method would remove the versionId.

--Joey

done --Joey

RSS Atom

comment 1

Note that it would also be possible for a valid object to be sent, but then get corrupted in the remote storage. I don't think that's what happened here.

If that did happen, a similar recovery process is also needed.

Which I think says that focusing on a recovery process, rather than on prevention, is more useful.

Comment by joey — Wed Dec 17 17:22:32 2025

comment 2

The OpenNeuro dataset ds005256 is a S3 bucket with versioning=yes, and a publicurl set, and exporttree=yes. With that combination, when S3 credentials are not set, the versionId is used, in the public url for downloading.

git clone https://github.com/OpenNeuroDatasets/ds005256.git
git-annex get stimuli/task-alignvideo/ses-01_run-02_order-01_content-harrymetsally.mp4

Note that this first does a download that fails incomplete with "Verification of content failed". Then it complains "Unable to access these remotes: s3-PUBLIC". It's trying two different download methods; the second one can only work with S3 credentials set.

git-annex fsck stimuli/task-alignvideo/ses-01_run-02_order-01_content-harrymetsally.mp4
fsck stimuli/task-alignvideo/ses-01_run-02_order-01_content-harrymetsally.mp4 (fixing location log) 
  ** Based on the location log, stimuli/task-alignvideo/ses-01_run-02_order-01_content-harrymetsally.mp4
  ** was expected to be present, but its content is missing.
failed

Note that this doesn't download, but fails at the checkPresent stage. At that point, the HTTP HEAD reports the size of the object, and it's too short.

Comment by joey — Wed Dec 17 18:30:06 2025

comment 3

If drop from export remote were implemented that would take care of #1.

The user can export a tree that removes the file themselves. fsck even suggests doing that when it finds a corrupted file on an exporttree remote, since it's unable to drop it in that case.

But notice that the fsck run above does not suggest doing that. Granted, with a S3 bucket with versioning, exporting a tree won't remove the corrupted version of the file from the remote anyway.

It seems that dealing with #2 here is enough to recover the problem dataset, and #1 can be left to that other todo.

Comment by joey — Fri Jan 23 16:42:49 2026

comment 4

After a lot of thought and struggling with layering issues between fsck and the S3 remote, here is a design to solve #2:

Add a new method repairCorruptedKey :: Key -> Annex Bool

fsck calls this when it finds a remote does not have a key it expected it to have, or when it downloads corrupted content.

If repairCorruptedKey returns True, it was able to repair a problem, and the Key should be able to be downloaded from the remote still. If it returns False, it was not able to repair the problem.

Most special remotes will make this pure False. For S3 with versioning=yes, it will download the object from the bucket, using each recorded versionId. Any versionId that does not work will be removed. And return True if any download did succeed.

In a case where the object size is right, but it's corrupt, fsck will download the object, and then repairCorruptedKey will download it a second time. If there were 2 files with the same content, it would end up being downloaded 3 times! So this can be pretty expensive, but it's simple and will work.

Comment by joey — Fri Jan 23 17:21:51 2026

comment 5

Started implementation in the repair branch.

Comment by joey — Fri Jan 23 20:52:33 2026

comment 5

1 is not needed for the case of a versioned S3 bucket, because after

git-annex fsck --from S3 corrects the problem, git-annex export --to S3 will see that the file is not in S3, and re-upload it.

In the general case, #1 is still needed. I think drop from export remote would solve this, and so no need to deal with it here.

Comment by joey — Mon Jan 26 16:41:32 2026

comment 7

Finished implementing recovery from a corrupted S3 version id.

Comment by joey — Mon Jan 26 16:56:12 2026

comment 8

I thought about making git-annex export checksum files before uploading, but I don't see why export needs that any more than a regular copy to a remote does. In either case, annex.verify will notice the bad content when getting from the remote, and fscking the remote will also detect it, and now, recover from it.

It seems unlikely to me that the annex object file got truncated before it was sent to ds005256 in any case. Seems more likely that the upload was somehow not of the whole file.

Comment by joey — Mon Jan 26 17:01:47 2026

Add a comment