This special remote type stores file contents in a bucket in Amazon S3 or a similar service.

See using Amazon S3 and Internet Archive via S3 for usage examples.

configuration

The standard environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are used to supply login credentials for Amazon. You need to set these only when running git annex initremote, as they will be cached in a file only you can read inside the local git repository.

A number of parameters can be passed to git annex initremote to configure the S3 remote.

  • encryption - One of "none", "hybrid", "shared", or "pubkey". See encryption.

  • keyid - Specifies the gpg key to use for encryption.

  • chunk - Enables chunking when storing large files. chunk=1MiB is a good starting point for chunking.

  • embedcreds - Optional. Set to "yes" embed the login credentials inside the git repository, which allows other clones to also access them. This is the default when gpg encryption is enabled; the credentials are stored encrypted and only those with the repository's keys can access them.

    It is not the default when using shared encryption, or no encryption. Think carefully about who can access your repository before using embedcreds without gpg encryption.

  • datacenter - Defaults to "US". Other values include "EU", "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", and "sa-east-1".

  • storageclass - Default is "STANDARD". If you have configured git-annex to preserve multiple copies, consider setting this to "REDUCED_REDUNDANCY" to save money.

  • host and port - Specify in order to use a different, S3 compatable service.

  • bucket - S3 requires that buckets have a globally unique name, so by default, a bucket name is chosen based on the remote name and UUID. This can be specified to pick a bucket name.

  • partsize - Amazon S3 only accepts uploads up to a certian file size, and storing larger files requires a multipart upload process.

    Setting partsize=1GiB is recommended for Amazon S3 when not using chunking; this will cause multipart uploads to be done using parts up to 1GiB in size. Note that setting partsize to less than 100MiB will cause Amazon S3 to reject uploads.

    This is not enabled by default, since other S3 implementations may not support multipart uploads or have different limits, but can be enabled or changed at any time. time.

    NOTE: there is a bug which depends on the AWS library. See this comment (the latest as of now).

  • fileprefix - By default, git-annex places files in a tree rooted at the top of the S3 bucket. When this is set, it's prefixed to the filenames used. For example, you could set it to "foo/" in one special remote, and to "bar/" in another special remote, and both special remotes could then use the same bucket.

  • x-amz-meta-* are passed through as http headers when storing keys in S3.

Just noting that the environment variables ANNEX_S3_ACCESS_KEY_ID and ANNEX_S3_SECRET_ACCESS_KEY seem to have been changed to AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Comment by Matt Tue May 29 12:40:25 2012
Thanks, I've fixed that. (You could have too.. this is a wiki ;)
Comment by joeyh.name Tue May 29 19:10:46 2012
Thanks! Being new here, I didn't want to overstep my boundaries. I've gone ahead and made a small edit and will do so elsewhere as needed.
Comment by Matt Wed May 30 00:26:33 2012

it'd be really nice being able to configure a S3 remote of the form <bucket>/<folder> (not really a folder, of course, just the usual prefix trick used to simulate folders at S3). The remote = bucket architecture is not scalable at all, in terms of number of repositories.

how hard would it be to support this?

thanks, this is the only thing that's holding us back from using git-annex, nice tool!

Comment by Eduardo Thu Aug 9 10:52:07 2012
I guess this could be useful if you have a lot of buckets already in use at S3, or if you want to be able to have a lot of distinct S3 special remotes. Implemented the fileprefix setting. Note that I have not tested it, beyond checking it builds, since I let my S3 account expire. Your testing would be appreciated.
Comment by joeyh.name Thu Aug 9 18:01:06 2012

Any chance I could bribe you to setup Rackspace Cloud Files support? We are using them and would hate to have a S3 bucket only for this.

https://github.com/rackspace/python-cloudfiles

Comment by alan Thu Aug 23 21:00:11 2012
Joey, I'm curious to understand how future proof an S3 remote is. Can I restore my files without git-annex?
Comment by Eric Sun Jan 20 09:21:50 2013

If encryption is not used, the files are stored in S3 as-is, and can be accessed directly. They are stored in a hashed directory structure with the names of their key used, rather than the original filename. To get back to the original filename, a copy of the git repo would also be needed.

With encryption, you need the gpg key used in the encryption, or, for shared encryption, a symmetric key which is stored in the git repo.

See future proofing for non-S3 specific discussion of this topic.

Comment by joeyh.name Sun Jan 20 20:37:09 2013

How do I recover a special remote from a clone, please? I see that remote.log has most of the details, but my remote is not configured on my clone and I see no obvious way to do it. And I used embedcreds, but the only credentials I can see are stored in .git/annex/creds/ so did not survive a clone. I'm confused because the documentation here for embedcreds says that clones should have access.

As a workaround, it looks like copying the remote over from .git/config as well as the credentials from .git/annex/creds/ seems to work. Is there some other way I'm supposed to do this, or is this the intended way?

Comment by basak Wed May 22 18:32:05 2013

You can enable a special remote on a clone by running git annex enableremote $name, where $name is the name you used to originally create the special remote. (Older versions of git-annex used git annex initremote to enable the special remote on the clone.)

(Just in case, I have verified that embedcreds does cause the cipher= to be stored in the remote.log. It does.)

Comment by joey Thu May 23 20:04:03 2013

Thanks Joey - initremote on my slightly older version appears to work. I'll use enableremote when I can.

(Just in case, I have verified that embedcreds does cause the cipher= to be stored in the remote.log. It does.)

This doesn't do what I expect. The documentation suggests that my S3 login credentials would be stored. I understand that the cipher would be stored; but isn't this a separate concept? Instead, I'm being asked to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; my understanding was that git-annex will keep them in the repository for me, so that I don't have to set them after running initremote before cloning. This works, apart from surviving the cloning. I'm using encryption=shared; does this affect anything? Or am I using a version of git-annex (3.20121112ubuntu3) that's too old?

Comment by basak Fri May 24 09:38:40 2013
Ah -- No, your AWS creds are not stored. While some other special remotes, like webdav, can store all necessary credentials, it's not done for AWS. I didn't want git-annex to be responsible for someone accidentially publishing their AWS creds to their friends, since that could cost them a lot of money.
Comment by joey Fri May 24 15:33:12 2013

That's not what the documentation here says! It even warns me: "Think carefully about who can access your repository before using embedcreds without gpg encryption."

My use case:

Occasional use of EC2, and a desire to store some persistent stuff in S3, since the dataset is large and I have limited bandwidth. I want to destroy the EC2 instance when I'm not using it, leaving the data in S3 for later.

If I use git-annex to manage the S3 store, then I get the ability to clone the repository and destroy the instance. Later, I can start a new instance, push the repo back up, and would like to be able to then pull the data back out of S3 again.

I'd really like the login credentials to persist in the repository (as the documentation here says it should). Even if I have to add a --yes-i-know-my-s3-credentials-will-end-up-available-to-anyone-who-can-see-my-git-repo flag. This is because I use some of my git repos to store private data, too.

If I use an Amazon IAM policy as follows, I can generate a set of credentials that are limited to access to a particular prefix of a specific S3 bucket only - effectively creating a sandboxed area just for git-annex:

{ 
  "Statement": [{"Sid": "Stmt1368780615583",
                 "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
                 "Effect": "Allow",
                 "Resource": ["arn:aws:s3:::bucketname/prefix/*"]}
                ],
  "Statement": [{"Sid": "Stmt1368781573129",
                "Action": ["s3:GetBucketLocation"],
                "Effect": "Allow",
                "Resource": ["arn:aws:s3:::bucketname"]}
               ]
}

Doing this means that I have a different set of credentials for every annex, so it would be really useful to be able have these stored and managed within the repository itself. Each set is limited to what the annex stores, so there is no bigger compromise I have to worry about apart from the compromise of the data that the annex itself manages.

Comment by basak Fri May 24 15:47:14 2013

I apologise for incorrect information. I was thinking about defaults when using the webapp.

I have verified that embedcreds=yes stores the AWS creds, always.

Comment by joey Fri May 24 16:45:25 2013
Is it possible to change the S3 endpoint hosts? I'm running a radosgw with S3 support which I'd like to define as S3 remote for git-annex
Comment by Tobias Fri Aug 23 08:59:32 2013
Yes, you can specify the host to use when setting up the remote. It's actually documented earlier on this very page, if ou search for "host". Any S3 compatabile host will probably work -- the Internet Archive's S3 does, for example.
Comment by joeyh.name Fri Aug 23 17:39:56 2013