Enable git-annex to provision file content by other means than download
This idea goes back many years, and has been iterated on repeatedly afterwards, and most recently at Distribits 2024. The following is a summary of what role git-annex could play in this functionality.
The basic idea is to wrap a provision-by-compute process into the standard interface of a git annex remote. A consumer would (theoretically) not need to worry about how an annex key is provided, they would simply get-annex-get it, whether this leads to a download or a computation. Moreover, access cost and redundancies could be managed/communicated using established patterns.
Use cases
Here are a few concrete use cases that illustrate why one would want to have functionality like this
Generate annex keys (that have never existed)
This can be useful for leaving instructions how, e.g. other data formats can be generated from a single format that is kept on storage. For example, a collection of CSV files is stored, but an XLSX variant can be generated upon request automatically. Or a single large live stream video is stored, and a collection of shorter clips is generated from a cue sheet or cut list.
Re-generate annex keys
This can be useful when storing a key is expensive, but its exact identity is known/important. For example, an outcome of a scientific computation yields a large output that is expensive to compute and to store, yet needs to be tracked for repeated further processing -- the cost of a recomputation may be reduced, by storing (smaller) intermediate results, and leaving instruction how to perform (a different) computation that yields the identical original output.
This second scenario, where annex keys are reproduced exactly, can be considered the general case. It generally requires exact inputs to the computation, where as the first scenario can/should handle an application of a compute instruction on any compatible input data.
What is in scope for git-annex?
The execution of arbitrary code without any inherent trust is a problem. A problem that git-annex may not want to get into. Moreover, there are many candidate environments for code execution -- a complexity that git-annex may not want to get into either.
External remote protocol sufficient?
From my point of view, pretty much all required functionality could be hidden behind the external remote protocol and thereby inside on or more special remote implementations.
STORE
: somehow capture the computing instructions, likely linking some code to some (key-specific) parameters, like input filesCHECKPRESENT
: do compute instruction for a key exist?RETRIEVE
: compute the keyREMOVE
: remove the instructions/parameter recordWHEREIS
: give information on computation/inputs
where SET/GETSTATE
may implement the instruction deposit/retrieval.
Worktree provisioning?
Such external remote implementation would need a way to create suitable worktrees to (re)run a given code. Git-annex support to provide a (separate) worktree for a repository at a specific commit, with efficient (re)use of the main repository's annex would simplify such implementations.
Request one key, receive many
It is possible that a single computation yields multiple annex keys, even when git-annex only asked for a single one (which is what it would do, sequentially, when driving a special remote). It would be good to be able to capture that and avoid needless duplication of computations.
Instruction deposition
Using STORE
(git annex copy --to
) record instructions is possible (say particular ENV variables are used that pass information to a special remote), but is more or less a hack. It would be useful to have a dedicated command to accept and deposit such a record in association with one or more annex keys (which may or may not be known at that time). This likely require settling on a format for such records.
Storage redundancy tests
I believe that no particular handling of annex key that are declared inputs to computing instructions for other keys are needed. Still listing it here to state that, and be possibly proven wrong.
Trust
We would need a way for users to indicate that they trust a particular compute introduction or the entity that provided it. Even if git-annex does not implement tooling for that, it would be good to settle on a concept that can be interpreted/implemented by such special remotes.
I just want to mention that I've implemented/tried to implement something like this in https://github.com/matrss/datalad-getexec. It basically just records a command line invocation to execute and all required input files as base64-encoded json in a URL with a custom scheme, which made it surprisingly simple to implement. I haven't touched it in a while and it was more of an experiment, but other than issues with dependencies on files in sub-datasets it worked pretty well. The main motivation to build it was the mentioned use-case of automatically converting between file formats. Of course it doesn't address all of your mentioned points. E.g. trust is something I haven't considered in my experiments, at all. But it shows that the current special remote protocol is sufficient for a basic implementation of this.
I like the proposed "request one key, receive many" extension to the special remote protocol and I think that could be useful in other "unusual" special remotes as well.
I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.
"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.
Thanks for the pointer, very useful!
Regarding the points you raised:
Datalad's
run
feature has been around for some years, and we have seen usage in the wild with command lines that are small programs and dozens, sometimes hundreds of inputs. It is true that anything could be simply URL-encoded. However, especially with command-patterns (always same, except parameter change) that may be needlessly heavy. Maybe it would compress well (likely), but it still poses a maintenance issue. Say the compute instructions need an update (software API change): Updating one shared instruction set is a simpler task than sifting through annex-keys and rewriting URLs.We need a worktree different from
HEAD
whenever HEAD has changed from the original worktree used for setting up a compute instruction. Say a command needs two input files, but one has been moved to a different directory in currentHEAD
. An implementation would now either say "no longer available" and force maintenance update, or be able to provision the respective worktree. In case of no provision capability we would need to replace the URL-encoded instructions (this would make the key uncomputable in earlier versions), or amend with an additional instruction set (and now we would start to accumulate cruft where changes in the git-annex branch need to account for (unrelated) changes in any other branch).An interesting benefit of using URL keys for this is the recently added VURL keys in today's release, which work just like url keys, except a checksum gets calculated when the content is downloaded from the url. This allows
git-annex fsck
to verify the checksums, as well as letting the checksum be verified when transferring the content between repositories. (Seegit-annex addurl --verifiable
documentation.)And a nice thing about using URL or VURL keys for this is that it allows for both fully reproducible computations and computations that generate equivilant but not identical files. The latter corresponds to
git-annex addurl --relaxed
.If you use a VURL key and give it a size, then the checksum is calculated on first download from your compute special remote, and subsequent downloads are required to have the same checksum. Without a size, it's relaxed and anything your compute special remote generates is treated as effectively the same key, so there can be several checksums that git-annex knows about, attached to the same VURL key.
About worktree provisioning, couldn't you record the sha1 of the tree containing the files needed to generate the object, and then use
git worktree
to make a temporary checkout of that tree? You couldgit-annex get
whatever files are necessary within the temp worktree, which could result in recursive computations to get dependencies.I would be careful to avoid dependency cycles though..
On trust, it seems to me that if someone chooses to enable a particular special remote, they are choosing to trust whatever kind of computations it supports.
Eg a special remote could choose to always run a computation inside a particular container system and then if you trust that container system is secure, you can choose to use it.
About request one, receive many, it would be possible for a special remote to run eg
git-annex reinject --guesskeys
to move additional generated object files into .git/annex/objects/.(Doesn't datalad do something like that when it download and unpacks a tarball that contains several annexed files besides the one it was running the download to get? Or perhaps it only stores the tarball in the annex and unpacks it several times?)
(I forgot to tick "email replies to me", sorry for the late reply)
My reasoning for suggesting to always stay in HEAD is this: Let's assume we have a file "data.grib" that we want to convert into "data.nc" using this compute special remote. We use its facilities to make it do exactly that. Now, if there was a bug in "data.grib" that necessitates an update, we would replace the file. The special remote could do two things then:
I think the first error is preferable over the second, because the second one is much more subtle and easy to miss.
This same reasoning extends to software as well, if it is somehow tracked in git: for the above mentioned conversion one could use "cdo" (climate data operators). One could pin a specific version of "cdo" with nix and its flake.lock file, meaning that there is an exact version of cdo associated with every commit sha of the git-annex/DataLad repository. If I update that lock file to get a new version of cdo, then as a user I would naively assume that re-converting "data.grib" to "data.nc" would now use this new version of cdo. With worktree provisioning it would silently use the old one instead.
IMO worktree provisioning would create an explosion of potential inputs to consider for the computation (the entire git history so far), which would create a lot of subtle pitfalls. Always using stuff from HEAD would be an easier implementation, easier to reason about, and make the user explicitly responsible for keeping the repository contents consistent.