universal batch mode

It would help if there was a universal batch mode, where git-annex command lines are given as lines in an input file, and are executed as a batch. A batch could contain different git-annex commands (as opposed to different parameters for one command). git-annex could intelligently group, reorder and parallelize the execution, as long as the overall effect of the batch is unchanged. (I.e. commands affecting different keys/paths could be run in parallel; commands repeatedly doing the same thing could be collapsed; git command batching could span different git-annex commands; etc.) I find myself implementing something like that in python on top of git-annex, but it would be much more efficient and robust if supported natively. Maybe, the DataLad project would also find this useful?

RSS Atom

comment 1

Datalad did request something like this, but IIRC I shot the idea down with a good reason. I can't remember why or where we discussed it.. @yoh might.

Comment by joey — Tue Mar 19 17:28:45 2019

Remove comment

found it

see datalad/issues/139. Quoting a part of it:

But I'd like to investigate adding --batch to individual commands first, since this seems more git-like, and also simpler. It would probably be helpful to talk about the specific commands you need to call a lot.

Things like git annex lookupkey --batch, git-annex readpresentkey --batch etc should be able to be spun up and run as long-duration servers, which you could query as needed, not batched up all at once. This is how git-annex uses git cat-file --batch etc.

There's some potential for such a long-running command to either buffer stale data so it doesn't answer with the current state of the repository, or for it to buffer changes and not commit them to disk immediately. For example, a git annex add --batch would have the latter problem.

That is actually an argument for only adding --batch mode to specific commands though, since that would be an opportunity to check thier behavior. A single git-annex shell interface would expose any such problems in all commands.

Comment by yarikoptic — Fri Sep 20 18:56:52 2019

Remove comment

comment 3

Thanks for digging that up.

Hmm, if the goal was to check each command for such problems when adding --batch, it didn't stop git-annex add --batch from being added, despite indeed having such a buffering behavior. You can currently shoot your foot combining that with git annex readpresentkey --batch, the same way as you could with a hypothetical universal batch mode that let you run add followed by readpresentkey.

I don't see a universal batch mode being really able to detect and avoid such problems either. How is it supposed to know that an add of "dir/" will amoung other things add the content of key FOO, which was not present before, and so a readpresentkey FOO should be delayed until after the add, and the add's buffer flushed. It would have to model the behavior of commands and insert barriers/flush points, and the modeling could necessarily not be that fine-grained, so it would need to flush the add buffer every time before readpresentkey. But there are surely ways to combine batch use of add and readpresentkey that you know won't be affected by the add buffering, and that would make those unncessarily slow.

Anyway, looking at the implementation of --batch for different commands, sometimes it's trivial enough to wish it were generalized, but other times there is batch-specific behavior. add --batch errors out if --update is also used. checkpresentkey --batch outputs status codes rather than the command's normal behavior of exiting 1/0. So we need batch-specific implementations.

idea: What might be good is a mode that lets any batch-capable commands be combined together, not trying to support every possible command, and perhaps with some added commands that the user can use to flush buffers etc between operations as desired. Eg:

git annex batch <<EOF
add dir/
add whatever
flush
readpresentkey FOO
EOF

Comment by joey — Mon Jan 6 18:53:02 2020

Remove comment

Add a comment