Recent comments posted to this site:
I thought about making git-annex export checksum files before uploading,
but I don't see why export needs that any more than a regular copy to a
remote does. In either case, annex.verify will notice the bad content when
getting from the remote, and fscking the remote will also detect it, and
now, recover from it.
It seems unlikely to me that the annex object file got truncated before it was sent to ds005256 in any case. Seems more likely that the upload was somehow not of the whole file.
Currently git-annex fsck --from an export remote is unable to drop a key
if it finds corrupted data. Implementing this would also deal with that
problem.
1 is not needed for the case of a versioned S3 bucket, because after
git-annex fsck --from S3 corrects the problem, git-annex export --to S3
will see that the file is not in S3, and re-upload it.
In the general case, #1 is still needed. I think drop from export remote would solve this, and so no need to deal with it here.
Finally ran into this myself, and I observed several podcast hosts still not supporting EMS even now.
Implemented a config to solve this:
git config annex.security.allow-insecure-https tls-1.2-no-EMS
I do caution against setting this globally for security reasons. At least not without understanding the security implications, which I can't say I do.
Even setting it in a single repo could affect other connections by git-annx to eg, API endpoints used for storage.
Personally, I am setting it only when importing feeds from those hosts:
git -c annex.security.allow-insecure-https=tls-1.2-no-EMS annex importfeed
Workaround: Make git-annex use curl for url downloads. Eg:
git config annex.security.allowed-ip-addresses all
git config annex.web-options --netrc
Note this using curl has other security implications, including letting git-annex download from IPs on the LAN.
repair branch.
After a lot of thought and struggling with layering issues between fsck and the S3 remote, here is a design to solve #2:
Add a new method repairCorruptedKey :: Key -> Annex Bool
fsck calls this when it finds a remote does not have a key it expected it to have, or when it downloads corrupted content.
If repairCorruptedKey returns True, it was able to repair a problem, and
the Key should be able to be downloaded from the remote still. If it
returns False, it was not able to repair the problem.
Most special remotes will make this pure False. For S3 with versioning=yes,
it will download the object from the bucket, using each recorded versionId.
Any versionId that does not work will be removed. And return True if any
download did succeed.
In a case where the object size is right, but it's corrupt, fsck will download the object, and then repairCorruptedKey will download it a second time. If there were 2 files with the same content, it would end up being downloaded 3 times! So this can be pretty expensive, but it's simple and will work.
Rather than altering the exported git tree, it could removeExport and then update the export log to say that the export is incomplete.
That would result in a re-export putting the file back on the remote.
It's not uncommon to eg want to git-annex move foo --from remote,
due to it being low on space, or to temporarily make it unavailable,
and later send the file back to the remote. Supporting drop from export
remotes in this way would allow for such a workflow, although with the
difference that git-annex export would be needed to put the file back.
It might also be possible to make sending a particular file to an export
remote succeed when the export to the remote is incomplete and the file is
in the exported tree. Then git-annex move foo --to remote would work to
put the file back.
If drop from export remote were implemented that would take care of #1.
The user can export a tree that removes the file themselves. fsck even suggests doing that when it finds a corrupted file on an exporttree remote, since it's unable to drop it in that case.
But notice that the fsck run above does not suggest doing that. Granted, with a S3 bucket with versioning, exporting a tree won't remove the corrupted version of the file from the remote anyway.
It seems that dealing with #2 here is enough to recover the problem dataset, and #1 can be left to that other todo.