Fails to get file from s3. How to recover?

I have an annex that has an s3 special remote. The s3 remote has been configured with shared encryption and it uses partsize (not chunking). Currently when I try to get a file from the s3 remote, it fails:

$ git annex get mybigfile.tbz.gpg
get mybigfile.tbz.gpg (from s3...)
76%         10.6MB/s 57sgpg: WARNING: encrypted message has been manipulated!

  Unable to access these remotes: s3

  Try making some of these repositories available:
        15ac19e4-223a-4c81-b7f7-797b9b026b86 -- [s3]

  (Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 1 failed

The file is about 3GB. This happens consistently at 76%. No other copy of the file exists. Is there some way I can get the file from s3, either without git annex or just have git annex ignore the error, so that I can inspect the file locally and see if there is anything wrong with it?

RSS Atom

comment 1

This also happens with a second file in the same repo, although the download seems to complete on this one before the error. The file is about the same size.

gxg otherfile.tgz.gpg
get otherfile.tgz.gpg (from s3...)
100%         10.7MB/s 0sgpg: WARNING: encrypted message has been manipulated!

  Unable to access these remotes: s3

  Try making some of these repositories available:
        15ac19e4-223a-4c81-b7f7-797b9b026b86 -- [s3]

  (Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 1 failed

Comment by annexuser — Sat Apr 30 17:30:18 2016

Remove comment

comment 2

What you could try is:

git config remote.s3.annex-gnupg-options --ignore-mdc-error
git config remote.s3.annex-verify false

That should make gpg ignore the error and proceed, and make git-annex not try to verify the download either. The resulting file will probably be corrupt in some way. It might have a bad chunk in the middle, or be truncated, or be garbage past a certian point.

This sounds like something went wrong with the multipart upload to S3. What does git annex info s3 say about the configuration of your S3 remote? I'd like to reproduce this problem, if possible. Since it hit two of your files, it seems to not have been some kind of one-off data corruption problem.

Comment by joey — Tue May 3 18:54:41 2016

Remove comment

comment 3

Thanks. I will try that and see if I can recover the files. Here is the output of git annex info s3:

remote: s3
description: [s3]
uuid: 15ac19e4-223a-4c81-b7f7-797b9b026b86
trust: semitrusted
cost: 250.0
type: S3
creds: stored locally
bucket: s3-15ac19e4-223a-4c81-b7f7-797b9b026b86
endpoint: s3.amazonaws.com
port: 80
storage class: OtherStorageClass "STANDARD_IA"
partsize: 1.07 gigabytes
public: no
encryption: encrypted (encryption key stored in git repository)
chunking: none
remote annex keys: 121924
remote annex size: 680.42 gigabytes

Comment by annexuser — Wed May 4 19:15:16 2016

Remove comment

comment 4

So you have a lot of data in there. Are all files larger than 1 gb failing to download, or only two of them?

Comment by joey — Thu May 5 15:55:29 2016

Remove comment

comment 4

Those settings don't seem to have an effect. I still get the same gpg: WARNING: encrypted message has been manipulated! error when getting the files after I set those options.

Comment by annexuser — Thu May 5 18:11:49 2016

Remove comment

comment 6

Yeah, it's a big one

Those are the only 2 that are failing. I have a ~50GB file that downloads from s3 without issue. Others in the ~15GB range that download fine also.

There are probably only ~30 files in the repo that are multiple gigabytes. The rest are all small files.

Comment by annexuser — Thu May 5 23:57:13 2016

Remove comment

comment 7

I see now why remote.s3.annex-gnupg-options is not working: That is currently only passed to gpg when it's encrypting, not when it's decrypting.

So, instead, edit ~/.gnupg/gpg.conf and put "ignore-mdc-error" in a line of its own in that file.

Comment by joey — Tue May 10 16:26:03 2016

Remove comment

comment 8

I set git config remote.s3.annex-verify false, and ignore-mdc-error in my gpg.conf. That allowed me to succesfully download the files. From there I was able to recover the bulk of the files in the archives. Thanks!

Any ideas what caused the files to become corrupted in the first place? Is there anything I can do different when adding new files and copying them to the s3 special remote to prevent corruption?

Comment by annexuser — Wed May 11 20:32:54 2016

Remove comment

comment 9

My best guess at the moment is that the corruption must have something to do with using a partsize.

Since you were able to get at the corrupted files, can you characterize at all how they were corrupted?

In particular, did the corruption start at some multiple of N GiB? into the file? If so, that would correlate with your 1 GiB part size.