ATM parallel download/upload is implemented via running multiple external special remotes. If such special remote is also hungry on file ids etc we had already cases for this approach to be prohibitive and otherwise requiring various locking etc mechanisms.
Some external remotes can natively support transfer of multiple files in parallel. My use case is globus. With globus I can queue up multiple transfers which would start happening in parallel, and then only poll on their status once in a while to see which finished and provide overall progress. Setting up transfer queue and initiating transfer has notable overhead with globus, so transfer of many small files "one at a time" would also be very slow
That is why I thought it would have been quite cool if special remotes protocol supported "queuing up" of the transfers, while reporting back to annex some transfer-id per each file, and then reporting back whenever any transfer-id is finished and potentially progress reports per each ID and/or overall. That can remove any burden from parallel execution of external special remotes and establish very efficient transfer mechanisms in some cases.
Just an idea
It could look like this:
Having separate job ids does not seem necessary since it has the key.
When both negotiate the ASYNC extension, then the special remote, for simplicity, should always use this async protocol, not mix in the non-async protocol.
From when git-annex sends TRANSFER until the special remote sends TRANSFER-ASYNC, the special remote can use send of the other special remote messages it needs to in order to handle the transfer, eg DIRHASH and GETCREDS. git-annex will avoid starting up any other transfers until after TRANSFER-ASYNC. This avoids what seems like it could be a lot of complexity in the special remote. (Imagine if git-annex sent another TRANSFER at the same time the special remote had sent DIRHASH; the special remote would need to defer handling that TRANSFER until it got the reply.) And after the TRANSFER-ASYNC, the special remote should refrain from sending anything further for that transfer except for PROGRESS-ASYNC and TRANSFER-SUCCESS/TRANSFER-FAILURE.
I don't think implementation in git-annex would be very hard, basically make a worker thread that's used for async transfers, that takes an external process from the pool and keeps it for its use as long as any transfers are running.
Note: This would be a further road block to implementating ?more extensive retries to mask transient failures, because there would be this one external process that's handling multiple transfers and killing it would stop them all rather than just the one that is wanted to be stopped..
Unless the protocol included a way to cancel a single transfer, eg "TRANSFER CANCEL Key1". But then the external program would need to support canceling a transfer, which it could be some protocols or libraries can't do cleanly, leading to the same kind of resource cleanup issues that are blocking that todo in git-annex.
Thank you Joey for considering implementing this! Some additional tune up might be desired for optimal "performance" if possible. Having globus in mind: it is better to "stage" all the transfers TODO and then commence the transfer. So may be it could be something like this where if "BATCH" is appended to TRANSFER-ASYNC response from remote, at the end of "scheduling" all transfers for now,
git annex
would trigger actual transfer with an emptyTRANSFER-ASYNC
?:Also per file progress might not be available (yet to check more details) -- there is a total progress for a job (multiple files). So may be
BATCH
could be the BATCH-ID of a kind and TRANSFER-PROGRESS be reported not for a key but for the BATCH-ID (absent overlap with KEYS values could be guaranteed by special remote by using UUIDs or alike)?Re retries and failures: may be if special remote would also return
TRANSFER-SUCCESS RETRIEVE BATCH-ID
at the end, annex could see if any KEY was not reported success on, and then retry only those?I see why you want this, but how is git-annex supposed to know when it has the right size batch of transfers ready?
It could look at the -J number, and wait until it's sent that many TRANSFER requests, and call that a batch. But it could be that some jobs run transfers and other jobs do other things (eg CHECKPRESENT or checking that the content is locally present and skipping doing a download). Leading to either a deadlock or a long time stalled out before beginning any transfers.
One strategy that would work for git-annex is to start the first transfer immediately. While that transfer is running, hold off on starting any more, batching up the requests. Send each batch of transfers after the last batch finishes. And with some messages for framing a batch of transfers for remotes that care.
What that would naturally result in is, at -Jn, batches of size
[1, n-1, n-1, ..., m < n]
unless transfers were happening faster than git-annex was able to queue up new ones.So that's pretty good. But I don't know if it's ideal for every special remote.
A special remote could implement the same strategy with no help from git-annex, and no changes to my proposed protocol. All you have to do is wait for that first TRANSFER request, call it a batch and start it, and gather the next batch, etc.
Or, if you know your remote works well with a certian batch size of transfers, you could gather up TRANSFER requests until you have the optimal number, or until a timeout, and then start the batch.
I don't know if that would work for globus, but it seems like a valid strategy for some hypothetical remotes. Since a remote can implement either strategy, maybe it's better to let them make use of remote-specific knowledge and not put the explicit batching in git-annex?
Might be not really useful if that transfer is tiny, so I would have not bothered with complicating logic for that
yes, could be done this way I guess, but would complicate special remotes implementation (reasonable tradeoff IMHO iff no easier solution is found).
May be if
async
remotes were not "parallelizable" (i.e. only a single such remote would start even in the case of -J > 1) - it would simplify it?Isn't there some point in time when
annex
knows that it is done issuing all possible TRANSFER requests it knew to issue and could just send out "initiate TRANSFER" to those special remotes it knows are waiting for it?I'm assuming only 1 that is used for transfers, others could be used for other things.
Not really, it starts as many threads as it's allowed to as soon as it has some files to work on, and as each thread finishes it starts a new one. Only time it knows it's done is after it's checked the whole tree of files, but that could be much much later.
async appendix has a draft protocol extension.
I improved on the design, so any and all requests can be handled async, or sequentially, as the external special remote prefers. Had to add async job ids, but the protocol simplicity was worth it.
(Implementation will be something like, a thread relaying to and from the special remote, with requests sent to it when it's not blocked, and with its async replies sent back to the corresponding requester based on the JobId.)
asyncexternal
branchASYNC extension is implemented. The protocol went through several iterations and ended up at about the simplest and cleanest possible way to do it.