Special remotes: support for MULTIREMOVE

Currently, when we use small chunks for large files, git annex drop can take a long time on external remotes. One possible solution seems to be that we batch multiple REMOVEs in a single call to the special remote so that it can be optimized by it if needed.

This is still not ideal, but seems reasonably safe and easy to do. Thoughts?

RSS Atom

comment 1

This extension to the protocol would only be useful when removing chunks, because otherwise git-annex doesn't have a way to build up a list of keys that are going to be removed, in a way that could usefully be sent to the external special remote together.

For chunks, it has a list of keys. So this is feasible.

I wonder if it's necessary to extend the protocol though. If an external special remote wants to, it can buffer a list of keys that it's been told to remove, and return REMOVE-SUCCESS to each request before actually doing the removal. It could then remove all the buffered keys in a single API call or whatever.

The risk of course is that if the removal fails, or it's interrupted before it can do the removal, it will have incorrectly claimed to remove the keys. And git-annex will have recorded incorrect information and wrongly indicated the removal succeeded. This would not be a good idea for non-chunk keys (although fsck --fast --from the remote could recover from it).

For a set of chunk keys that are all chunks of the same key, though, git-annex doesn't record anything until they've all been successfully removed. Also, it so happens that after asking for all the chunked keys to be removed, git-annex normally[1] then asks for the unchunked key to be removed too. So, a special remote could buffer chunked keys until it sees an unchunked key, and then remove them all efficiently, and reply to the removal of the unchunked key with the combined result of all the removals.

[1] The exception is that, if the special remote is not currently configured to use chunking, git-annex happens to remove the unchunked key first, followed by all the chunked keys. I don't think there is a good reason for this in removal; it's a useful optimisation for retrieving content that happens to affect removal too.

Comment by joey — Tue Sep 13 16:02:58 2022

Remove comment

comment 2

I don't know that the above comment is really a good idea for an external remote to try to implement. Needing to know about chunk keys is not too bad, but it also relies on details of git-annex's implementation.

But it seems worth considering possibilities like that, since this extension would only be used in such a relative corner case.

Or possibly considering ways to generalize the idea to be usable in more cases..

Along those lines, it occurs to me that the async extension to the protocol is somewhat similar, since git-annex can ask the external remote to do several things at the same time. Removals of chunk keys are not currently run concurrently, but they could be. An external remote could then gather together some number of concurrent remove requests and perform them all in a single API call (or whatever).

But how would the external remote know when it's seen all the remove requests for chunks of a key? It seems like it would need to use a heuristic, like no new requests in some amount of time means git-annex is waiting on it to remove everything it's been requested to remove.

So it might be that a protocol extension would be useful, some way for git-annex to indicate that it is blocked waiting on current requests to finish. That seems more general purpose than a MULTIREMOVE extension. For example, git-annex could also send it when retrieving chunks. (Although retrieving chunks is also not currently done concurrently.)

(There's also a question of how many concurrent removals of chunk keys it would make sense for git-annex to request at the same time. It could request removing all chunks concurrently but if the special remote needs to do much work or use resources for each request, that might not be good. It would probably be more natural to use something based on -J.)

Comment by joey — Tue Sep 13 16:26:06 2022

Remove comment

My current use case

Thanks for the response, Joey.

Let me start by providing more details the use case where I noticed this slowness.

I was using a slow remote with a lot of chunks. I stopped the upload and wanted to do a cleanup of the uploaded chunks. That's when I noticed that git-annex was requesting a removal of each chunk individually, even ones that never actually got uploaded.

In this particular case, I could "preload" the data since I knew which chunks were valid and which ones weren't to make it faster (though I actually just waited it out)

Also, like you mentioned, this MULTIREMOVE is most useful for this specific case so a more generic solution will definitely be much better.

Comment by prancewit — Tue Sep 13 19:45:23 2022

Remove comment

comment 4

At a high level, I see 2 possible ways in which special remotes can optimize for larger queries

Pre-fetch or cache state of existing keys (Mostly useful only for no-op requests. For instance, pre-fetch the list of keys in the remote enabling no-op REMOVEs, but hard to tell if there's been a separate change since the fetch)
Batch multiple requests in a single call. (Batching can be done before or after sending the SUCCESS response to git-annex with corresponding results)

So it might be that a protocol extension would be useful, some way for git-annex to indicate that it is blocked waiting on current requests to finish.

I can think of a few related ways to do this:

Have the remote send ACKs to notify that it's ready for the next request, and send SUCCESS when the request is actually completed. The remote can then have the flexibility to run them in whatever batch/async manner suitable. In the case of chunk-keys, git-annex could rapidly send successive keys in sequence since there's no additional lookup required making it pretty efficient.
Have git-annex send some kind of group identifier (all chunks of same key might be grouped together) or delimiter(eg: GROUP_COMPLETED). This acts as a hint that these requests could be batched together without any obligation from the remote to do so. Coupled with a guarantee that all items in one group will be sent sequentially, the first item that belongs to a different group provides a guarantee that the previous group is completed. In this case, the SUCCESS for the last item could be taken to mean that the entire group is completed. One issue here is that this could leak some information in encrypted repositories.
Define a CACHE_TIMEOUT_SECONDS. This could be used by the remote to decide if any pre-fetched or cached data can be trusted or if they should be re-checked. Git-annex would use this during merge/sync with other remotes to determine if there's a conflict that needs to be handled differently. (Seems too complicated TBH, but trying to see how we can make pre-fetch/caching work)

Comment by prancewit — Tue Sep 13 21:36:38 2022

Remove comment

comment 5

I think what's needed is essentially a way for the external remote, when it gets a request, to ask git-annex, "is there anything else I could do at the same time?"

This way the external remote can build up a list of requests of whatever size makes sense for it. And when git-annex answers "no, nothing else", the external remote knows that it needs to process the current list it has built up; git-annex is blocked waiting on the response.

That seems to capture what is needed in the simplest way possible (aside from just implementing a special purpose MULTIREMOVE that is!)

At the protocol level, it could look something like this:

REMOVE keya
ANY-MOREa
REMOVE keyb
ANY-MORE
REMOVE keyc
ANY-MORE
NO-MORE
REMOVE-SUCCESS keya
REMOVE-FAILURE keyb permission denied somehow
REMOVE-SUCCESS keyc

But, implementation of that is not generic. removeKey sends REMOVE and waits for a REMOVE-SUCCESS/FAILURE. So it would need to handle the ANY-MORE by sending the next key. So, removeKey would need to be changed to take a list of keys, and return a list of results. Which seems unsatisfying, since it's the same change that would be needed to implement MULTIREMOVE. Any other actions would also need to be changed to take lists in order to support ANY-MORE.

What if it were only used with the async protocol? Then it could look like this:

J 1 REMOVE keya
J 1 ANY-MORE
J 2 REMOVE keyb
J 1 SENT-MORE 2
J 2 ANY-MORE
J 3 REMOVE keyc
J 2 SENT-MORE 3
J 3 ANY-MORE
J 3 NO-MORE
J 1 REMOVE-SUCCESS keya
J 2 REMOVE-FAILURE keyb permission denied somehow
J 3 REMOVE-SUCCESS keyc

This is more complicated due to using the async protocol. And also it's complicating what the remote needs to do, because it has to parse the SENT-MORE to find the job number that the next request was sent to. (Needed since there could be other, unrelated jobs being started at the same time.)

The advantage is, it should be possible to implement that in git-annex without changing removeKey but only handleRequestKey. That will need a TVar that contains related requests. To handle ANY-MORE, it can get the next item from the TVar, start a thread to run it, and send the job number in SENT-MORE.

A separate function runMulti can then discover that TVar somehow, and call removeKey with the first key, populating the TVar with the rest, etc. Resulting in a list of removeKey responses.

Would each call to removeKey need to have the TVar be passed to it? Maybe not.. It seems to me that it would be ok to use the same TVar for all removeKeys for the same external remote, even ones that are handling different jobs. Eg, when git-annex -J2 is dropping two different keys, which each has a set of chunk keys, it's ok for the external remote to collect a set of keys to remove that combines both sets of chunk keys. It's even a win, because it lets the special remote batch more actions together!

And this is generalized; it could support checkPresent and storeKey and retrieveKeyFile, by using a separate TVar for each. Doing those operations on chunks would need further changes to use runMulti.

It might even be possible to get an external remote to remote bundle together 10 checkPresent calls when handling git-annex drop -J10 (for example). Which would allow for a nice speedup even when not using chunks. This would I suppose need an equivilant runSingle that's used when calling checkPresent? Something like that.

(I'm not sure how runMulti would discover the TVar to use for a given action and remote. Implementing that does not seem straightforward. One way, though hopefully not the only way, would be to make removeKey return a tuple of the TVar and an action that actually handles the removal.)

Comment by joey — Thu Sep 15 16:41:44 2022

Remove comment

comment 6

I love these ideas Joey. Couple of comments.

I think what's needed is essentially a way for the external remote, when it gets a request, to ask git-annex, "is there anything else I could do at the same time?"

100% agreed. ANY-MORE is super generic and will be useful in all case. There's a simpler but slightly less generic option too. Instead of git-annex asking the remote after every action, it could instead ask first about the number of actions the remote wants. Something like,

git-annex -- remote
-> REMOVE-COUNT
<- REMOVE-COUNT 3
-> REMOVE keya
-> REMOVE keyb
-> REMOVE keyc
<- REMOVE-SUCCESS keya
<- REMOVE-FAILURE keyb permission denied somehow
<- REMOVE-SUCCESS keyc
-> REMOVE-COUNT
<- REMOVE-COUNT 3
.
.
.

What we primarily lose here is that the remote does not have any information about the current action to base the decision on, and is instead a generic decision based on the action type (For instance, the remote could ask for 100 REMOVEs at a time, but only 10 ADDs). I'm not sure if there are use-cases where this additional information is useful to the remote, since remotes are typically "dumb" (What I mean is that by design the remotes typically doesn't understand the intent behind the action, they just need to know which file to add/remove/delete etc).

This is more complicated due to using the async protocol. And also it's complicating what the remote needs to do, because it has to parse the SENT-MORE to find the job number that the next request was sent to. (Needed since there could be other, unrelated jobs being started at the same time.)

I'm assuming you mean that it is difficult to batch related operations in a single job due to implementation details in git-annex. However, just in case I'm misunderstanding that, I'd like to see if there's a good way for us to batch all related operations in one job instead of splitting them across jobs. My main argument for this case would be that if we send related keys/operations to multiple jobs, the only performance benefit we get is in IO with git-annex. This might be worth losing for the simplicity in which each job is guaranteed to get all related keys/operations. Maybe something like,

git-annex -- remote
-> J 1 REMOVE-COUNT
-> J 2 REMOVE-COUNT
<- J 1 REMOVE-COUNT 3
<- J 2 REMOVE-COUNT 3
-> J 1 REMOVE keya
-> J 1 REMOVE keyb
-> J 1 REMOVE keyc
-> J 2 REMOVE keyd
<- J 1 REMOVE-SUCCESS keya
<- J 1 REMOVE-FAILURE keyb permission denied somehow
-> J 2 REMOVE keye
-> J 2 REMOVE keyf
<- J 1 REMOVE-SUCCESS keyc
<- J 2 REMOVE-SUCCESS keyd
<- J 2 REMOVE-SUCCESS keye
<- J 2 REMOVE-SUCCESS keyf

It could potentially have slightly less performance (For instance, all keys might belong to one batch, so we are effectively single threaded), but the simplicity seems useful. I'm not super clear on the haskell/git-annex implementation which might be why I'm not understanding this.

Comment by prancewit — Fri Sep 16 08:39:08 2022

Remove comment

comment 7

I'm not sure if there are use-cases where this additional information is useful to the remote

It would be useful when the actions transfer content, since the remote could look at the key sizes and accumulate a reasonable amount of data for it to handle in one operation.

This is more complicated

By that I meant, it's more complicated for the poor author of the external special remote, who has to learn about the async protocol.

But not using the async protocol will always have the problem I discussed in comment #5, since the implementation in git-annex is always to send a command like REMOVE and wait for a response like REMOVE-SUCCESS. Using the async protocol lets it keep that implementation simple, by multiplexing the multiple removes into different jobs, which when demultiplexed look the same as a single remove looks now.

Comment by joey — Fri Sep 16 16:27:58 2022

Remove comment

Add a comment