Recent comments posted to this site:
@joey:
I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.
I think it's excellent to have something like this in git-annex. I didn't have the opportunity to try it out yet, but I am definitely looking forward to seeing how things can work in practice and comparing the implementations.
And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command
Absolutely!
You do need to use "--" before your own custom dashed options.
And bear in mind that "field=value" parameters passed to initremote will be passed on to the program. So you can have a generic helper that is instantiated with a parameter like --command=, which then gets used automatically when running addcompute:
git-annex initremote foo type=compute program=git-annex-compute-generic-helper -- --command='convert {inputs} {outputs}'
git-annex addcomputed --to=foo -- -i foo.jpeg -o foo.gif
it was more flexible to have a more freeform command line, which the compute program parses
agree. And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command, smth like (mimicing datalad-run interface here):
git-annex addcomputed --to=runcmd -i foo.jpeg -o foo.gif
as long as we can pass options like that or after --
, e.g.
git-annex addcomputed --to=runcmd -- -i foo.jpeg -o foo.gif -- convert {inputs} {outputs}`
which would then
- ensure no stdout from convert
- follow the compute special remote interface to let git-annex know what inputs/outputs were
git-annex does know what both the input and the output files are. It learns this by running the compute program and seeing what INPUT and OUTPUT lines it emits.
I considered having some --input=
option, but decided that it was more
flexible to have a more freeform command line, which the compute program
parses.

I've merged the compute special remote now. See compute, git-annex-addcomputed and git-annex-recompute.
I have opened compute special remote remaining todos with some various ways that I want to improve it further. Including, notably, computing on inputs from submodules, which is not currently supported at all.
Here I'll go down mih's original and quite useful design criteria and see how the compute special remote applies to them:
Generate annex keys (that have never existed)
git-annex addcomputed --fast
Re-generate annex keys
git-annex addcomputed
optionally with the --reproducible option,
followed by a later git-annex get
Another thing that fits under this heading is when one of the original
input files has gotten modified, and you want to compute a new version of
the output file from it, using the same method as was used to compute it
before. That's git-annex recompute $output_file
Worktree provisioning?
This is the main thing I didn't implement. Given that git-annex is working with large files and needs to support various filesystems and OS's that lack hardlinks and softlinks, it's hard to do this inexpensively.
Also, it turned out to make sense for the compute program to request the input files it needs, since this lets git-annex learn what the input files are, so it can make them available when regenerating a computed file later. And so the protocol just has git-annex respond with the path to the content of the file.
Request one key, receive many
This is supported. (So is using multiple inputs to produce one (or more) outputs.)
Instruction deposition
git-annex addcomputed
Storage redundancy tests
It did make sense to have it automatically git-annex get
the inputs.
Well, I think it makes sense in most cases, this may become a tunable
setting of the compute special remote.
Trust
Handled by requiring the user install a git-annex-compute-foo
command
in PATH, and provide the name of the command to initremote
.
And for later enableremote
or autoenable=true
, it will only
allow programs that are listed in the annex.security.allowed-compute-programs
git config.
Thanks for explaining the design points of datalad-remake. Some different design choices than I have made, but mostly they strike me as implementing what is easier/possible from outside git-annex.
Eg, storing the compute inputs under .datalad
in the branch is fine --
and might even be useful if you want to make a branch that changes
something in there -- but of course in the git-annex implementation it
stores the equvilant thing in the git-annex branch.
I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.
Your fMRI case seems like one that my compute remote could handle well and easily.
I think it could make sense, when --incremental/--more are not passed, to initialize a new fsck database if there is not already one, and add each fscked key to the fsck database.
That way, the user could run any combination of fscks, interrupted or not, and then use --more to fsck only new files. When the user wants to start a new fsck pass, they would use --incremental.
It would need to avoid recording an incremental fsck pass start time, to avoid interfering with --incremental-schedule.
The only problem I see with this is, someone might have a long-term incremental fsck they're running that is doing full checksumming. If they then do a quick fsck --fast for other reasons, it would record that every key has been fscked, and so lose their place. So it seems --fast should disable this new behavior. (Also incremental --fast fsck is not likely to be very useful anyway.)
That's a hard judgement call for a program to make... someone might think 10 minutes is really old, and someone else that a month is.
As to figuring out whether a fsck was interrupted before, surely what matters is you remembering that? All git-annex has is a timestamp when the last fsck pass started, which is available in
.git/annex/fsck/*/state
, and a list of the keys that were fscked, which is not very useful as far as determining the progress of that fsck.