Enable git-annex to provision file content by other means than download

This idea goes back many years, and has been iterated on repeatedly afterwards, and most recently at Distribits 2024. The following is a summary of what role git-annex could play in this functionality.

The basic idea is to wrap a provision-by-compute process into the standard interface of a git annex remote. A consumer would (theoretically) not need to worry about how an annex key is provided, they would simply get-annex-get it, whether this leads to a download or a computation. Moreover, access cost and redundancies could be managed/communicated using established patterns.

Use cases

Here are a few concrete use cases that illustrate why one would want to have functionality like this

Generate annex keys (that have never existed)

This can be useful for leaving instructions how, e.g. other data formats can be generated from a single format that is kept on storage. For example, a collection of CSV files is stored, but an XLSX variant can be generated upon request automatically. Or a single large live stream video is stored, and a collection of shorter clips is generated from a cue sheet or cut list.

Re-generate annex keys

This can be useful when storing a key is expensive, but its exact identity is known/important. For example, an outcome of a scientific computation yields a large output that is expensive to compute and to store, yet needs to be tracked for repeated further processing -- the cost of a recomputation may be reduced, by storing (smaller) intermediate results, and leaving instruction how to perform (a different) computation that yields the identical original output.

This second scenario, where annex keys are reproduced exactly, can be considered the general case. It generally requires exact inputs to the computation, where as the first scenario can/should handle an application of a compute instruction on any compatible input data.

What is in scope for git-annex?

The execution of arbitrary code without any inherent trust is a problem. A problem that git-annex may not want to get into. Moreover, there are many candidate environments for code execution -- a complexity that git-annex may not want to get into either.

External remote protocol sufficient?

From my point of view, pretty much all required functionality could be hidden behind the external remote protocol and thereby inside on or more special remote implementations.

STORE: somehow capture the computing instructions, likely linking some code to some (key-specific) parameters, like input files
CHECKPRESENT: do compute instruction for a key exist?
RETRIEVE: compute the key
REMOVE: remove the instructions/parameter record
WHEREIS: give information on computation/inputs

where SET/GETSTATE may implement the instruction deposit/retrieval.

Worktree provisioning?

Such external remote implementation would need a way to create suitable worktrees to (re)run a given code. Git-annex support to provide a (separate) worktree for a repository at a specific commit, with efficient (re)use of the main repository's annex would simplify such implementations.

Request one key, receive many

It is possible that a single computation yields multiple annex keys, even when git-annex only asked for a single one (which is what it would do, sequentially, when driving a special remote). It would be good to be able to capture that and avoid needless duplication of computations.

Instruction deposition

Using STORE (git annex copy --to) record instructions is possible (say particular ENV variables are used that pass information to a special remote), but is more or less a hack. It would be useful to have a dedicated command to accept and deposit such a record in association with one or more annex keys (which may or may not be known at that time). This likely require settling on a format for such records.

Storage redundancy tests

I believe that no particular handling of annex key that are declared inputs to computing instructions for other keys are needed. Still listing it here to state that, and be possibly proven wrong.

Trust

We would need a way for users to indicate that they trust a particular compute introduction or the entity that provided it. Even if git-annex does not implement tooling for that, it would be good to settle on a concept that can be interpreted/implemented by such special remotes.

done, with compute special remote remaining todos having some more things that could be done to improve this. --Joey

RSS Atom

prior art

I just want to mention that I've implemented/tried to implement something like this in https://github.com/matrss/datalad-getexec. It basically just records a command line invocation to execute and all required input files as base64-encoded json in a URL with a custom scheme, which made it surprisingly simple to implement. I haven't touched it in a while and it was more of an experiment, but other than issues with dependencies on files in sub-datasets it worked pretty well. The main motivation to build it was the mentioned use-case of automatically converting between file formats. Of course it doesn't address all of your mentioned points. E.g. trust is something I haven't considered in my experiments, at all. But it shows that the current special remote protocol is sufficient for a basic implementation of this.

I like the proposed "request one key, receive many" extension to the special remote protocol and I think that could be useful in other "unusual" special remotes as well.

I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.

"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.

Comment by m.risse — Sat Apr 13 20:30:56 2024

Remove comment

Need for more than HEAD/URL?

"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.

Thanks for the pointer, very useful!

Regarding the points you raised:

Datalad's run feature has been around for some years, and we have seen usage in the wild with command lines that are small programs and dozens, sometimes hundreds of inputs. It is true that anything could be simply URL-encoded. However, especially with command-patterns (always same, except parameter change) that may be needlessly heavy. Maybe it would compress well (likely), but it still poses a maintenance issue. Say the compute instructions need an update (software API change): Updating one shared instruction set is a simpler task than sifting through annex-keys and rewriting URLs.

I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.

We need a worktree different from HEAD whenever HEAD has changed from the original worktree used for setting up a compute instruction. Say a command needs two input files, but one has been moved to a different directory in current HEAD. An implementation would now either say "no longer available" and force maintenance update, or be able to provision the respective worktree. In case of no provision capability we would need to replace the URL-encoded instructions (this would make the key uncomputable in earlier versions), or amend with an additional instruction set (and now we would start to accumulate cruft where changes in the git-annex branch need to account for (unrelated) changes in any other branch).

Comment by mih — Mon Apr 15 05:00:58 2024

Remove comment

comment 3

Comment by joey — Tue Apr 30 19:31:35 2024

Remove comment

comment 4

An interesting benefit of using URL keys for this is the recently added VURL keys in today's release, which work just like url keys, except a checksum gets calculated when the content is downloaded from the url. This allows git-annex fsck to verify the checksums, as well as letting the checksum be verified when transferring the content between repositories. (See git-annex addurl --verifiable documentation.)

And a nice thing about using URL or VURL keys for this is that it allows for both fully reproducible computations and computations that generate equivilant but not identical files. The latter corresponds to git-annex addurl --relaxed.

If you use a VURL key and give it a size, then the checksum is calculated on first download from your compute special remote, and subsequent downloads are required to have the same checksum. Without a size, it's relaxed and anything your compute special remote generates is treated as effectively the same key, so there can be several checksums that git-annex knows about, attached to the same VURL key.

Comment by joey — Tue Apr 30 19:34:34 2024

Remove comment

comment 5

About worktree provisioning, couldn't you record the sha1 of the tree containing the files needed to generate the object, and then use git worktree to make a temporary checkout of that tree? You could git-annex get whatever files are necessary within the temp worktree, which could result in recursive computations to get dependencies.

I would be careful to avoid dependency cycles though..

Comment by joey — Tue Apr 30 19:48:55 2024

Remove comment

comment 6

On trust, it seems to me that if someone chooses to install a particular special remote, they are choosing to trust whatever kind of computations it supports.

Eg a special remote could choose to always run a computation inside a particular container system and then if you trust that container system is secure, you can choose to install it.

Enabling the special remote is not necessary, because a repository can be set to autoenable a special remote. In some sense this is surprising. I had originally talked about enabling here and then I remembered autoenable.

It may be that autoenable should only be allowed for special remote programs that the user explicitly whitelists, not only installs into PATH. That would break some existing workflows, though setting some git configs would not be too hard.

There seems scope for both compute special remotes that execute code that comes from the git repository, and ones that only have metadata about the computation recorded in the git repository, in a way that cannot let them execute arbitrary code under the control of the git repository.

A well-behaved compute special remote that does run code that comes from a git repository could require an additional git config to be set to allow it to do that.

Comment by joey — Tue Apr 30 19:53:43 2024

Remove comment

comment 7

About request one, receive many, it would be possible for a special remote to run eg git-annex reinject --guesskeys to move additional generated object files into .git/annex/objects/.

(Doesn't datalad do something like that when it download and unpacks a tarball that contains several annexed files besides the one it was running the download to get? Or perhaps it only stores the tarball in the annex and unpacks it several times?)

Comment by joey — Tue Apr 30 20:00:53 2024

Remove comment

Re: worktree provisioning

(I forgot to tick "email replies to me", sorry for the late reply)

My reasoning for suggesting to always stay in HEAD is this: Let's assume we have a file "data.grib" that we want to convert into "data.nc" using this compute special remote. We use its facilities to make it do exactly that. Now, if there was a bug in "data.grib" that necessitates an update, we would replace the file. The special remote could do two things then:

Try to convert "data.grib" from current HEAD to "data.nc", possibly failing if the checksums no longer match (if git-annex is instructed to check those).
Silently use the old version of "data.grib", creating a mismatch between "data.nc" and "data.grib" as available on HEAD (and in this case using a buggy version of the data).

I think the first error is preferable over the second, because the second one is much more subtle and easy to miss.

This same reasoning extends to software as well, if it is somehow tracked in git: for the above mentioned conversion one could use "cdo" (climate data operators). One could pin a specific version of "cdo" with nix and its flake.lock file, meaning that there is an exact version of cdo associated with every commit sha of the git-annex/DataLad repository. If I update that lock file to get a new version of cdo, then as a user I would naively assume that re-converting "data.grib" to "data.nc" would now use this new version of cdo. With worktree provisioning it would silently use the old one instead.

IMO worktree provisioning would create an explosion of potential inputs to consider for the computation (the entire git history so far), which would create a lot of subtle pitfalls. Always using stuff from HEAD would be an easier implementation, easier to reason about, and make the user explicitly responsible for keeping the repository contents consistent.

Comment by m.risse — Tue May 28 12:06:39 2024

Remove comment

comment 9

Circling back to this, I think the fork in the road is whether this is about git-annex providing this and that feature to support external special remotes that compute, or whether git-annex gets a compute special remote of its own with some simpler/better extension interface than the external special remote protocol.

Of course, git-annex having its own compute special remote would not preclude other external special remotes that compute. And for that matter, a single external special remote could implement an extension interface.

Thinking about how a generic compute special remote in git-annex could work, multiple instances of it could be initremoted:

git-annex initremote convertfiles type=compute program=csv-to-xslx
git-annex initremote cutvideo type=compute program=ffmpeg-cut

Here the "program" parameter would cause a program like git-annex-compute-ffmpeg-cut to be run to get files from that instance of the compute special remote. The interface could be as simple as it being run with the key that it is requested to compute, and outputting the paths to the all keys it was able to compute. (So allowing for "request one key, receive many".) Perhaps also with some way to indicate progess of the computation.

It would make sense to store the details of computations in git-annex metadata. And a compute program can use git-annex commands to get files it depends on. Eg, git-annex-compute-ffmpeg-cut could run:

# look up the configured metadata
starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)

# get the source video file
git-annex get --key=$source
git-annex examinekey --format='${objectpath}' $source

It might be worth formalizing that a given computed key can depend on other keys, and have git-annex always get/compute those keys first. And provide them to the program in a worktree?

When asked to store a key in the compute special remote, it would verify that the key can be generated by it. Using the same interface as used to get a key.

This all leaves a chicken and egg problem, how does the user add a computed file if they don't know the key yet?

The user could manually run the commands that generate the computed file, then git-annex add it, and set the metadata. Then git-annex copy --to the compute remote would verify if the file can be generated, and add it if so. This seems awkward, but also nice to be able to do manually.

Or, something like VURL keys could be used, with an interface something like this:

git-annex addcomputed foo --to ffmpeg-cut
    --input compute-ffmpeg-source=input.mov 
    --set compute-ffmpeg-starttime=15:00
    --set compute-ffmpeg-endtime=30:00

All that would do is generate some arbitrary VURL key or similar, provisionally set the provided metadata (how?), and try to store the key in the compute special remote. If it succeeds, stage an annex pointer and commit the metadata. Since it's a VURL key, storing the key in the compute special remote would also record the hash of the generated file at that point.

Comment by joey — Mon Jan 27 14:46:43 2025

Remove comment

comment 10

Using metadata to store the inputs of computations like I did in my example above seems that it would allow the metadata to be changed later, which would change the output when a key gets recomputed. That feels surprising, because metadata could be changed for any reason, without the intention of affecting a compute special remote.

It might be possible for git-annex to pin down the current state of metadata (or the whole git-annex branch) and provide the same input to the computation when it's run again. (Unless git-annex forget has caused that old branch state to be lost..) But it can't fully isolate the program from all unpinned inputs without using some form of containerization, which feels out of scope for git-annex.

Instead of using metadata, the input values could be stored in the per-special-remote state of the generated key. Or the input values could be encoded in the key itself, but then two computations that generate the same output would have two different keys, rather than hashing to the same key.

Using a key with a regular hash backend also lets the user find out if the computation turns out to not be reproducible later for whatever reason; getting the file from the compute special remote will fail at hash verification time. Something like a VURL key could still alternatively be used in cases where reproducibility is not important.

To add a computed file, the interface would look close to the same, but now the --value options are setting fields in the compute special remote's state:

git-annex addcomputed foo --to ffmpeg-cut \
    --input source=input.mov \
    --value starttime=15:00 \
    --value endtime=30:00

The values could be provided to the "git-annex-compute-" program with environment variables.

For --input source=foo, it could look up the git-annex key (or git sha1) of that file, and store that in the state. So it would provide the compute program with the same data every time. But it could also store the filename. And that allows for a command like this:

git-annex recompute foo --from ffmpeg-cut

Which, when the input.mov file has been changed, would re-run the computation with the new content of the file, and stage a new version of the computed file. It could even be used to recompute every file in a tree:

git-annex recompute . --from ffmpeg-cut

Also, that command could let input values be adjusted later:

git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
git commit -m 'include the introduction of the speaker in the clip'

It would also be good to have a command that examines a computed key and displays the values and inputs. That could be git-annex whereis or perhaps a dedicated command with more structured output:

git-annex examinecompute foo --from ffmpeg-cut
source=input.mov (annex key SHA256--xxxxxxxxx)
starttime=15:00
endtime=30:00

This all feels like it might allow for some useful workflows...

Comment by joey — Tue Jan 28 14:06:41 2025

Remove comment

Re: worktree provisioning

@m.risse in your example the "data.nc" file gets new content when retrieved from the special remote and the source file has changed.

But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.

So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do.

In your "cdo" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.

Comment by joey — Tue Jan 28 14:08:29 2025

Remove comment

comment 12

My design so far does not fully support "Request one key, receive many".

My git-annex addcomputed command doesn't handle the case where a computation generates multiple output files. While the git-annex-compute- command's interface could let it return several computed files, addcomputed would only adds one file to the name that the user specifies. What is it supposed to do if the computation generates more than one? Maybe it needs a way to let a whole directory be populated with the files generated by a computation. Or a way to specify multiple files to add.

And here's another problem: Suppose I have one very expensive computation that generates files foo and bar. And a second, less expensive computation, that also generates foo (same content) as well as generating baz. Both computations are run on the same compute special remote. Now if the user runs git-annex get foo, they will be unhappy if it chooses to run the expensive computation, rather than the less expensive computation.

Since the per-special remote state for a key is used as the computation input, only one input can be saved for foo's key. So it wouldn't really be picking between two alernatives, it would just use whatever the current state for that key is.

Comment by joey — Tue Jan 28 15:39:44 2025

Remove comment

comment 13

@m.risse in your example the "data.nc" file gets new content when retrieved from the special remote and the source file has changed.

True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.

But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.

So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do

This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.

In your "cdo" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.

Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way "better" than the old one and updating the repository to record this new key as that file).

Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.

Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.

Comment by matrss — Wed Jan 29 09:56:12 2025

Remove comment

comment 14

Some thoughts regarding your ideas:

Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from https://atris.fz-juelich.de/MeteoCloud/ERA5, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation. git annex get doesn't work recursively across submodules/subdatasets though, and datalad get does not understand keys, just paths (at least so far).

Comment by matrss — Wed Jan 29 10:13:59 2025

Remove comment

Re: comment 13

@m.risse earlier you said that it would be bad to

Silently use the old version of "data.grib", creating a mismatch between "data.nc" and "data.grib"

That's what I was getting at when I said:

But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.

So just using files from HEAD for the computation is not sufficient to avoid this kind of mismatch. The user will need some workflow to deal with it.

Eg, they could recompute data.nc whenever data.grib is updated, and so make a commit that updates both files together. But if they're doing that, why does the computation need to use files from HEAD? Recomputing data.nc could just as well pin the new key of data.grib.

Comment by joey — Thu Feb 13 16:36:45 2025

Remove comment

Re: crossing repository boundaries

It could be argued that git-annex should recurse into submodules. Oddly, I don't remember that anyone has ever tried to make that argument. If they did it was a long time ago. It may be that datalad has relieved enough of the pressure in that area that it's not bothering many people.

Anyway, I wouldn't want to tie compute special remotes to changing git-annex in that way, but I also wouldn't want to rule out adding useful stuff to git-annex just because it breaches the submodule boundary in a way that's new to git-annex.

Thinking about a command like this:

git-annex addcomputed foo --to ffmpeg-cut \
    --input source=submodule/input.mov \
    --value starttime=15:00 \
    --value endtime=30:00

That would need to look inside the submodule to find the input key.

When getting the key later, it can't rely on the tree still containing the same submodules at the same locations. git mv submodule foo would break the computation.

I think that can be dealt with by having it fall back to checking location logs of all submodules, to find the submodule that knows about a key.

Deleting a submodule would still break the computation, and that seems difficult to avoid. Seems acceptable.

Comment by joey — Thu Feb 13 17:01:52 2025

Remove comment

comment 17

I've written up a draft interface for programs used by a compute special remote: compute special remote interface

Comment by joey — Thu Feb 13 20:10:52 2025

Remove comment

comment 18

I've started a compute branch which so far has documentation for the compute special remote, git-annex addcomputed, and git-annex recompute

I am pretty happy with how this design is shaping up.

Comment by joey — Wed Feb 19 18:29:58 2025

Remove comment

open questions

One thing that I am unsure about is what should happen if git-annex get foo needs the content of file bar, which is not present. Should it get bar from a remote? Or should it fail to get foo?

Consider that, in the case of git-annex get foo --from computeremote, the user has asked it to get a file from that particular remote, not from whatever remote contains bar.

If the same compute remote can also compute bar, it seems quite reasonable for git-annex get foo --from computeremote to also compute bar. (This is similar to a single computation that generates two output files, in which case getting one of them will get both of them.)

And it seems reasonable for git-annex get foo with no specified remote to also get or compute bar, from whereever.

But, there is no way at the level of a special remote to tell the difference between those two commands.

Maybe the right answer is to define getting a file from a compute special remote as including getting its inputs from other remotes. Preferring getting them from the same compute special remote when possible, and when not, using the lowest cost remote that works, same as git-annx get does.

Or this could be a configuration of the compute special remote. Maybe some would want to always get source files, and others would want to never get source files?

A related problem is that, foo might be fairly small, but bar very large. So getting a small object can require getting or generating other large objects. Getting bar might fail because there is not enough space to meet annex.diskreserve. Or the user might just be surprised that so much disk space was eaten up. But dropping bar after computing foo also doesn't seem like a good idea; the user might want to hang onto their copy now that they have it, or perhaps move it to some faster remote.

Maybe preferred content is the solution? After computing foo with bar, keep the copy of bar if the local repository wants it, drop it otherwise.

Progress display is also going to be complicated for this. There is no way in the special remote interface to display the progress for bar while getting foo.

Probably the thing to do would be to add together the sizes of both files, and display a combined progress meter. It would be ok to not say when it's getting the input file. This will need a way to set the size for a progress display to larger than the size of the key.

.... All 3 problems above go away if it doesn't automatically get input files before computations and the computations instead just fail with an error saying the input file is not present.

But then consider the case where you just want every file in the repository. git-annex get . failing to compute some files because their input files happen to come after them in the directory listing is not good.

Comment by joey — Wed Feb 19 18:39:41 2025

Remove comment

DataLad exploration of the compute on demand space

Thank you for the interesting discussion @matrss & @joey, you raise a lot of great points. I will need some time to take that all in, so I don't have any direct comments about the open questions. Meanwhile, I am excited to see the git-annex implementation taking shape.

Of course, git-annex having its own compute special remote would not preclude other external special remotes that compute. And for that matter, a single external special remote could implement an extension interface.

With that in mind, I would like to share that my colleagues from the DataLad team (Psychoinformatics group) have experimented in parallel with a datalad-remake special remote (not released yet, but FMPOV already functional). I am not the best person to explain the design decisions (I was more of a user-tester) but these are the key elements (note: I don't think any of these are final):

Essential parameters for recompute are associated with the files using datalad-remake:// URLs (note to self: learn more about VURL).
You can associate different sets of parameters (different URLs, leading to different computations, e.g. containerized vs native environment) by using a "label" parameter; preferred label is chosen via config.
The association can be prospective, a'la git annex addurl --relaxed.
The actual compute instructions and data dependencies are stored in TOML files under .datalad, in the same branch as the file to be recomputed.
The trust is addressed by requiring the TOML instructions to be added in a gpg-signed git commit. A user-scope config declares the key ids trusted for that purpose.
The (re)-computation is done in a git worktree provisioned for the purpose (which means using the past state, not HEAD).
Files listed as data dependencies are retrieved with git annex get.
This also works if some dependencies are in subdatasets (submodules).

I am looking forward to exploring convergence -- or specialization -- with the git-annex implementation!

The model use case for me personally is fMRI preprocessing, which involves computing and applying spatial transforms on 4D (3D, repeated over time) images. The initial computation is time-consuming, and it produces a large target file (transformed image) as well as several small ancillary files (mostly transformation matrices). Applying these ancillary files to the raw image (which would typically be stored in a subdataset of the dataset which holds the results) is cheap, and can be reproducible in a byte-exact fashion. So I would run the initial computation normally, and then create "shortcut" recompute instructions (data dependencies are pretty straightforward in this case), attach them to the target file, and drop its contents. In practice, that would mean trading a few hundred MB for a minute or two of recomputing. I know this is domain-specific, but FTR, I made a demo dataset with a short write-up: ds005479-remake-demo.

Comment by msz — Wed Mar 5 13:31:41 2025

Remove comment

Re: DataLad exploration of the compute on demand space

Thanks for explaining the design points of datalad-remake. Some different design choices than I have made, but mostly they strike me as implementing what is easier/possible from outside git-annex.

Eg, storing the compute inputs under .datalad in the branch is fine -- and might even be useful if you want to make a branch that changes something in there -- but of course in the git-annex implementation it stores the equvilant thing in the git-annex branch.

I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.

Your fMRI case seems like one that my compute remote could handle well and easily.

Comment by joey — Thu Mar 6 17:39:04 2025

Remove comment

comment 22

I've merged the compute special remote now. See compute, git-annex-addcomputed and git-annex-recompute.

I have opened compute special remote remaining todos with some various ways that I want to improve it further. Including, notably, computing on inputs from submodules, which is not currently supported at all.

Here I'll go down mih's original and quite useful design criteria and see how the compute special remote applies to them:

Generate annex keys (that have never existed)

git-annex addcomputed --fast

Re-generate annex keys

git-annex addcomputed optionally with the --reproducible option, followed by a later git-annex get

Another thing that fits under this heading is when one of the original input files has gotten modified, and you want to compute a new version of the output file from it, using the same method as was used to compute it before. That's git-annex recompute $output_file

Worktree provisioning?

This is the main thing I didn't implement. Given that git-annex is working with large files and needs to support various filesystems and OS's that lack hardlinks and softlinks, it's hard to do this inexpensively.

Also, it turned out to make sense for the compute program to request the input files it needs, since this lets git-annex learn what the input files are, so it can make them available when regenerating a computed file later. And so the protocol just has git-annex respond with the path to the content of the file.

Request one key, receive many

This is supported. (So is using multiple inputs to produce one (or more) outputs.)

Instruction deposition

git-annex addcomputed

Storage redundancy tests

It did make sense to have it automatically git-annex get the inputs. Well, I think it makes sense in most cases, this may become a tunable setting of the compute special remote.

Trust

Handled by requiring the user install a git-annex-compute-foo command in PATH, and provide the name of the command to initremote.

And for later enableremote or autoenable=true, it will only allow programs that are listed in the annex.security.allowed-compute-programs git config.

Comment by joey — Thu Mar 6 17:54:50 2025

Remove comment

comment 23

@joey:

I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it.

I think it's excellent to have something like this in git-annex. I didn't have the opportunity to try it out yet, but I am definitely looking forward to seeing how things can work in practice and comparing the implementations.

Comment by msz — Wed Mar 12 19:44:23 2025

Remove comment

Add a comment