It appears that git-annex issues one GET request to S3 / Google cloud for every file it tries to copy, if you don't pass --fast. (I could be wrong; I'm basing this on the fact that each "checking " takes about the same amount of time, and that it's slow enough to be hitting the network.)
Amazon lets you GET 1000 objects in one GET request, and afaict a request that returns 1000 objects costs just as much as a request that returns 1 object. The cost of GET'ing every file in my annex is nontrivial -- Google charges 0.01 per 1000 GETs, and my repo has 130k objects, so that's $1.3, compared to a monthly cost for storage of under $10. This means that if I want to back up my files more than, say, once a week, I need to write a script that parses the JSON output of git annex whereis and uploads with --fast only the files that aren't present in the cloud. It also means that I have to trust the output of whereis.
All those GETs also slow down the non-fast copy, and this also applies to other kinds of remotes.
There are a number of ways one could implement this. One way would be to have a command that updates the whereis data from the remote and then to add a parameter (maybe you already have it) to copy that's like --fast but skips files that are already present (maybe this is what --fast already does, but I did a quick check and it doesn't seem to). Because of the way git annex names files, I think it would be hard to coalesce GETs during a copy command, but it could be done.
Anyway, please don't consider this a high-priority request; I can get by as-is, and I <3 git annex.
The man page documents this:
As you've noted, this has to rely on the location tracking information being up-to-date, so if it's not it might miss copying a file to the remote that the remote doesn't currently have but used to. Otherwise, it's fine to use
copy --fast --to --remote
orcopy --not --in remote --to remote
, which is functionally identical.The check is not a GET request, it's a HEAD request, to check if the file is present. Does S3 have a way to combine multiple HEAD requests in a single http request? That seems unlikely. Maybe it is enough to reuse an open http connection for multiple HEADs? Anything needing a single HEAD request would not fit well into git-annex, but ways to do more caching of open http connections are being considered.
Oh jeez, I screwed that up wrt HEAD and GET. Sorry. The cost per HEAD on Google is 1/10 the price of GET, so we're talking $.13 to HEAD my 130k-file annex, which is totally reasonable.
One can GET a bucket, which is what I was looking at. This returns up to 1000 elements of its contents (and there's a way to iterate over larger buckets). Of course this would only be useful if the majority of files in the bucket were of interest to git-annex, and it sounds like more trouble than it's worth at the prices I'm seeing.
There might be a throughput improvement to be had by keeping the connection alive, although in my brief investigation, I think there may be a larger gain to be had by pipelining the various steps. Based on the fact that git-annex oomed when trying to upload a large file from my rpi, it seems like maybe the whole file is encrypted in memory before it's uploaded? And certainly the HEAD(s) appear not to be done in parallel with the upload.
Sorry again for that HEAD/GET fail.
The OOM is ?S3 memory leaks; fixed in the s3-aws branch.
Yeah, GET of a bucket is doable. Another problem with it though is, if the bucket has a lot of contents, such as many files, or large files split into many chunks, that all has to be buffered in memory or processed as a stream. It would make sense in operations where git-annex knows it wants to check every key in a bucket.
git annex unused --from $s3remote
is the case that springs to mind where it could be quite useful to do that. Integrating it withget
, not so much.I'd be inclined to demote this to a wishlist todo item to try to use bucket GET for
unused
. And/or rethink whether it makes sense forcopy --to
to run in --fast mode by default. I've been back and forth on that question before, but just from a runtime perspective, not from a 13 cents perspective.