Having very big files in chunked special remotes makes operations like fsck, move or drop extremely slow. Eg. a movie of 20GiB in a special remote with the recommended chunk size of 50MiB results in 400 lookups each time the presence of the key is checked. One could increase the chunk size, but that has other disadvantages: a) Chunks are apparently still buffered in memory b) one would lose more transfer progress if the special remote does not implement resumable transfers and c) it would render the use of git annex inprogress
for streaming purposes useless. Letting the remote know about all the keys that are about to be checked would allow it to optimise the request to the remote server.
So I propose a protocol extension CHECKPRESENT-MULTI Key1 Key2...
to let the remote do the optimised lookup after which it can reply with a list of present keys PRESENT Key1 Key2...
. This list can be empty if none of the keys are present: PRESENT
.
--Lykos
gsutil ls
to list the present files, instead of callinggsutil stat
on each file individually.Export remotes don't use chunks, and use CHECKPRESENTEXPORT rather than CHECKPRESENT. And git-annex tries to not buffer the whole worktree in memory, but stream through it, so it can support very large worktrees. So this idea, which I do think is a good idea, seems limited to checking chunks.
(I'd probably want to make the chunk handling code only include up to 1 million or so chunk keys in a request, again to avoid using too much memory. 1 million chunk keys needs 160 mb ram max, 80 or less typically.)
At least for purposes of chunks, the reply to CHECKPRESENT-MULTI only needs to say if all the keys are present. If even one chunk is missing, the object as a whole is not present in the remote. That seems like a useful simplication.
Internally, probably Remote.checkPresent should change to taking a
[Key]
list. Simpler than adding a whole other method for this.Remote.External could use CHECKPRESENT when there's one key in the list, and CHECKPRESENT-MULTI when the are multiple, falling back to CHECKPRESENT on an UNSUPPORTED-REQUEST reply. But, I think it ought to be an extension to the protocol, to avoid that extra roundtrip.
I agree about the simplification. However, when resuming an upload with, say, 400 chunks where only 10 are missing, after CHECKPRESENT-MULTI-FAILURE, we'd need to CHECKPRESENT another 390 keys until we can continue. Sure, the remote could cache the replies, but another idea would be for the remote to reply with the last key in the list that is present.
Example: