Recent comments posted to this site:

I think it could make sense, when --incremental/--more are not passed, to initialize a new fsck database if there is not already one, and add each fscked key to the fsck database.

That way, the user could run any combination of fscks, interrupted or not, and then use --more to fsck only new files. When the user wants to start a new fsck pass, they would use --incremental.

It would need to avoid recording an incremental fsck pass start time, to avoid interfering with --incremental-schedule.

The only problem I see with this is, someone might have a long-term incremental fsck they're running that is doing full checksumming. If they then do a quick fsck --fast for other reasons, it would record that every key has been fscked, and so lose their place. So it seems --fast should disable this new behavior. (Also incremental --fast fsck is not likely to be very useful anyway.)

I actually don't see much reason to not make use of an incremental fsck either unless it's really old

That's a hard judgement call for a program to make... someone might think 10 minutes is really old, and someone else that a month is.

As to figuring out whether a fsck was interrupted before, surely what matters is you remembering that? All git-annex has is a timestamp when the last fsck pass started, which is available in .git/annex/fsck/*/state, and a list of the keys that were fscked, which is not very useful as far as determining the progress of that fsck.

Comment by joey Mon Mar 17 18:34:20 2025

@joey:

I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.

I think it's excellent to have something like this in git-annex. I didn't have the opportunity to try it out yet, but I am definitely looking forward to seeing how things can work in practice and comparing the implementations.

Comment by msz Wed Mar 12 19:44:23 2025

And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command

Absolutely!

You do need to use "--" before your own custom dashed options.

And bear in mind that "field=value" parameters passed to initremote will be passed on to the program. So you can have a generic helper that is instantiated with a parameter like --command=, which then gets used automatically when running addcompute:

git-annex initremote foo type=compute program=git-annex-compute-generic-helper -- --command='convert {inputs} {outputs}'
git-annex addcomputed --to=foo -- -i foo.jpeg -o foo.gif
Comment by joey Tue Mar 11 16:42:46 2025

it was more flexible to have a more freeform command line, which the compute program parses

agree. And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command, smth like (mimicing datalad-run interface here):

git-annex addcomputed --to=runcmd -i foo.jpeg -o foo.gif 

as long as we can pass options like that or after --, e.g.

git-annex addcomputed --to=runcmd -- -i foo.jpeg -o foo.gif -- convert {inputs} {outputs}`

which would then - ensure no stdout from convert - follow the compute special remote interface to let git-annex know what inputs/outputs were

Comment by yarikoptic Tue Mar 11 15:15:15 2025
Thank you for the clarification -- I have missed that there is an "entire" compute special remote interface. Cool!
Comment by yarikoptic Tue Mar 11 15:09:20 2025

git-annex does know what both the input and the output files are. It learns this by running the compute program and seeing what INPUT and OUTPUT lines it emits.

I considered having some --input= option, but decided that it was more flexible to have a more freeform command line, which the compute program parses.

Comment by joey Mon Mar 10 20:42:26 2025
given the passed time, indeed it might be not as pressing of an issue, but indeed might be nice to have for ReproNim and beyond ;)
Comment by yarikoptic Sun Mar 9 01:02:54 2025
I don't see an option to specify which annexed files are input files, so annex could get them for comparing to happen to produce output file. That's what we do in datalad run, and it is very handy since allows to not worry about figuring out what to get first
Comment by yarikoptic Sat Mar 8 14:51:20 2025

I've merged the compute special remote now. See compute, git-annex-addcomputed and git-annex-recompute.

I have opened compute special remote remaining todos with some various ways that I want to improve it further. Including, notably, computing on inputs from submodules, which is not currently supported at all.


Here I'll go down mih's original and quite useful design criteria and see how the compute special remote applies to them:

Generate annex keys (that have never existed)

git-annex addcomputed --fast

Re-generate annex keys

git-annex addcomputed optionally with the --reproducible option, followed by a later git-annex get

Another thing that fits under this heading is when one of the original input files has gotten modified, and you want to compute a new version of the output file from it, using the same method as was used to compute it before. That's git-annex recompute $output_file

Worktree provisioning?

This is the main thing I didn't implement. Given that git-annex is working with large files and needs to support various filesystems and OS's that lack hardlinks and softlinks, it's hard to do this inexpensively.

Also, it turned out to make sense for the compute program to request the input files it needs, since this lets git-annex learn what the input files are, so it can make them available when regenerating a computed file later. And so the protocol just has git-annex respond with the path to the content of the file.

Request one key, receive many

This is supported. (So is using multiple inputs to produce one (or more) outputs.)

Instruction deposition

git-annex addcomputed

Storage redundancy tests

It did make sense to have it automatically git-annex get the inputs. Well, I think it makes sense in most cases, this may become a tunable setting of the compute special remote.

Trust

Handled by requiring the user install a git-annex-compute-foo command in PATH, and provide the name of the command to initremote.

And for later enableremote or autoenable=true, it will only allow programs that are listed in the annex.security.allowed-compute-programs git config.

Comment by joey Thu Mar 6 17:54:50 2025

Thanks for explaining the design points of datalad-remake. Some different design choices than I have made, but mostly they strike me as implementing what is easier/possible from outside git-annex.

Eg, storing the compute inputs under .datalad in the branch is fine -- and might even be useful if you want to make a branch that changes something in there -- but of course in the git-annex implementation it stores the equvilant thing in the git-annex branch.

I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.

Your fMRI case seems like one that my compute remote could handle well and easily.

Comment by joey Thu Mar 6 17:39:04 2025