I have an annex that has an s3 special remote. The s3 remote has been configured with shared encryption and it uses partsize (not chunking). Currently when I try to get a file from the s3 remote, it fails:
$ git annex get mybigfile.tbz.gpg
get mybigfile.tbz.gpg (from s3...)
76% 10.6MB/s 57sgpg: WARNING: encrypted message has been manipulated!
Unable to access these remotes: s3
Try making some of these repositories available:
15ac19e4-223a-4c81-b7f7-797b9b026b86 -- [s3]
(Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 1 failed
The file is about 3GB. This happens consistently at 76%. No other copy of the file exists. Is there some way I can get the file from s3, either without git annex or just have git annex ignore the error, so that I can inspect the file locally and see if there is anything wrong with it?
This also happens with a second file in the same repo, although the download seems to complete on this one before the error. The file is about the same size.
What you could try is:
That should make gpg ignore the error and proceed, and make git-annex not try to verify the download either. The resulting file will probably be corrupt in some way. It might have a bad chunk in the middle, or be truncated, or be garbage past a certian point.
This sounds like something went wrong with the multipart upload to S3. What does
git annex info s3
say about the configuration of your S3 remote? I'd like to reproduce this problem, if possible. Since it hit two of your files, it seems to not have been some kind of one-off data corruption problem.Thanks. I will try that and see if I can recover the files. Here is the output of git annex info s3:
So you have a lot of data in there. Are all files larger than 1 gb failing to download, or only two of them?
gpg: WARNING: encrypted message has been manipulated!
error when getting the files after I set those options.Yeah, it's a big one
Those are the only 2 that are failing. I have a ~50GB file that downloads from s3 without issue. Others in the ~15GB range that download fine also.
There are probably only ~30 files in the repo that are multiple gigabytes. The rest are all small files.
I see now why remote.s3.annex-gnupg-options is not working: That is currently only passed to gpg when it's encrypting, not when it's decrypting.
So, instead, edit ~/.gnupg/gpg.conf and put "ignore-mdc-error" in a line of its own in that file.
I set
git config remote.s3.annex-verify false
, andignore-mdc-error
in mygpg.conf
. That allowed me to succesfully download the files. From there I was able to recover the bulk of the files in the archives. Thanks!Any ideas what caused the files to become corrupted in the first place? Is there anything I can do different when adding new files and copying them to the s3 special remote to prevent corruption?
My best guess at the moment is that the corruption must have something to do with using a partsize.
Since you were able to get at the corrupted files, can you characterize at all how they were corrupted?
In particular, did the corruption start at some multiple of N GiB? into the file? If so, that would correlate with your 1 GiB part size.
Or was it just a small area of corruption?
Or is the corrupt file truncated?