Comments in the moderation queue:
Recent comments posted to this site:
The OOM is S3 memory leaks; fixed in the s3-aws branch.
Yeah, GET of a bucket is doable. Another problem with it though is, if the bucket has a lot of contents, such as many files, or large files split into many chunks, that all has to be buffered in memory or processed as a stream. It would make sense in operations where git-annex knows it wants to check every key in a bucket. git annex unused --from $s3remote is the case that springs to mind where it could be quite useful to do that. Integrating it with get, not so much.
git annex unused --from $s3remote
I'd be inclined to demote this to a wishlist todo item to try to use bucket GET for unused. And/or rethink whether it makes sense for copy --to to run in --fast mode by default. I've been back and forth on that question before, but just from a runtime perspective, not from a 13 cents perspective.
Oh jeez, I screwed that up wrt HEAD and GET. Sorry. The cost per HEAD on Google is 1/10 the price of GET, so we're talking $.13 to HEAD my 130k-file annex, which is totally reasonable.
One can GET a bucket, which is what I was looking at. This returns up to 1000 elements of its contents (and there's a way to iterate over larger buckets). Of course this would only be useful if the majority of files in the bucket were of interest to git-annex, and it sounds like more trouble than it's worth at the prices I'm seeing.
There might be a throughput improvement to be had by keeping the connection alive, although in my brief investigation, I think there may be a larger gain to be had by pipelining the various steps. Based on the fact that git-annex oomed when trying to upload a large file from my rpi, it seems like maybe the whole file is encrypted in memory before it's uploaded? And certainly the HEAD(s) appear not to be done in parallel with the upload.
Sorry again for that HEAD/GET fail.
When it resumes, it will start at 0% but jump forward to the resume point pretty quickly, after verifying which chunks have already been sent.
If any full chunk gets transferred, I'd expect it to resume. This may not be very obvious it's happening for smaller files.
I have been running git annex testremote against S3 special remotes today, and have not managed to reproduce this problem (using either the old S3 or the new AWS libraries). It could be anything, including a problem with your network or the network between you and the S3 endpoint. Have you tried using a different S3 region?
git annex testremote
The S3 library that git-annex is using does not support the authentication method that this region uses.
It is supported by the aws library that git-annex uses in the s3-aws branch in git, and I already added the region there this morning.
I can't merge s3-aws yet; the neccessary version of the aws library is not yet available in eg, Debian. And even upgrading aws from cabal seems to result in dependency hell, due to its needing a newer version of scientific. This should all sort itself out in time.
If you need this region, you'll need to try to build git-annex's s3-aws branch, for now.
The man page documents this:
To avoid contacting the remote to check if it has every
file when copying --to the repository, specify --fast
As you've noted, this has to rely on the location tracking information being up-to-date, so if it's not it might miss copying a file to the remote that the remote doesn't currently have but used to. Otherwise, it's fine to use copy --fast --to --remote or copy --not --in remote --to remote, which is functionally identical.
copy --fast --to --remote
copy --not --in remote --to remote
The check is not a GET request, it's a HEAD request, to check if the file
is present. Does S3 have a way to combine multiple HEAD requests in a
single http request? That seems unlikely. Maybe it is enough to reuse an
open http connection for multiple HEADs? Anything needing a single HEAD request would not fit well into git-annex, but ways to do more caching of open http connections are being considered.
I forgot to include the creds in git annex info for glacier; fixed that now.
git annex info
It seems that changing the creds with enableremote did embed them into your git repository, but it neglected to update the .git/annex/creds/$remoteuuid file that caches the creds locally. So I think that your old creds are still cached there, and still being used, and this explains why the file is not found in glacier; the wrong creds are being used to access it! You can work around this by deleting the .git/annex/creds/$remoteuuid file correspnding to the uuid of the glacier remote. (You can also look at that file and compare it with what the creds are supposed to be.) I have fixed git-annex enableremote to update that creds file.
Also, it looks like you did not fall afoul of the insecure embedded creds problem! If you had, this new version of git-annex would be complaining that it had detected that problem. If you want to double-check that, the s3creds= value is base64 encoded, and when run through base64 -d, it should yield a gpg encrypted file. If your repo did have that problem, it would instead decode to the creds in clear text.
I struggle to see how you could draw that conclusion from what I said.
git-annex will work fine in an existing git repository. You can mix regular git commands like git add, git push, git pull, git merge with git-annex commands like git annex add, git annex copy --to origin, git annex get, git annex merge, in the same repository.
git annex add
git annex copy --to origin
git annex get
git annex merge
The git annex sync command effcetively runs git commit; git pull; git annex merge; git push; git annex copy --to origin; git annex get. If you don't want to run all those commands at once, you don't want to run git annex sync. That will not prevent you from using git-annex in any way.
git annex sync
git commit; git pull; git annex merge; git push; git annex copy --to origin; git annex get