Enable git-annex to provision file content by other means than download
This idea goes back many years, and has been iterated on repeatedly afterwards, and most recently at Distribits 2024. The following is a summary of what role git-annex could play in this functionality.
The basic idea is to wrap a provision-by-compute process into the standard interface of a git annex remote. A consumer would (theoretically) not need to worry about how an annex key is provided, they would simply get-annex-get it, whether this leads to a download or a computation. Moreover, access cost and redundancies could be managed/communicated using established patterns.
Use cases
Here are a few concrete use cases that illustrate why one would want to have functionality like this
Generate annex keys (that have never existed)
This can be useful for leaving instructions how, e.g. other data formats can be generated from a single format that is kept on storage. For example, a collection of CSV files is stored, but an XLSX variant can be generated upon request automatically. Or a single large live stream video is stored, and a collection of shorter clips is generated from a cue sheet or cut list.
Re-generate annex keys
This can be useful when storing a key is expensive, but its exact identity is known/important. For example, an outcome of a scientific computation yields a large output that is expensive to compute and to store, yet needs to be tracked for repeated further processing -- the cost of a recomputation may be reduced, by storing (smaller) intermediate results, and leaving instruction how to perform (a different) computation that yields the identical original output.
This second scenario, where annex keys are reproduced exactly, can be considered the general case. It generally requires exact inputs to the computation, where as the first scenario can/should handle an application of a compute instruction on any compatible input data.
What is in scope for git-annex?
The execution of arbitrary code without any inherent trust is a problem. A problem that git-annex may not want to get into. Moreover, there are many candidate environments for code execution -- a complexity that git-annex may not want to get into either.
External remote protocol sufficient?
From my point of view, pretty much all required functionality could be hidden behind the external remote protocol and thereby inside on or more special remote implementations.
STORE
: somehow capture the computing instructions, likely linking some code to some (key-specific) parameters, like input filesCHECKPRESENT
: do compute instruction for a key exist?RETRIEVE
: compute the keyREMOVE
: remove the instructions/parameter recordWHEREIS
: give information on computation/inputs
where SET/GETSTATE
may implement the instruction deposit/retrieval.
Worktree provisioning?
Such external remote implementation would need a way to create suitable worktrees to (re)run a given code. Git-annex support to provide a (separate) worktree for a repository at a specific commit, with efficient (re)use of the main repository's annex would simplify such implementations.
Request one key, receive many
It is possible that a single computation yields multiple annex keys, even when git-annex only asked for a single one (which is what it would do, sequentially, when driving a special remote). It would be good to be able to capture that and avoid needless duplication of computations.
Instruction deposition
Using STORE
(git annex copy --to
) record instructions is possible (say particular ENV variables are used that pass information to a special remote), but is more or less a hack. It would be useful to have a dedicated command to accept and deposit such a record in association with one or more annex keys (which may or may not be known at that time). This likely require settling on a format for such records.
Storage redundancy tests
I believe that no particular handling of annex key that are declared inputs to computing instructions for other keys are needed. Still listing it here to state that, and be possibly proven wrong.
Trust
We would need a way for users to indicate that they trust a particular compute introduction or the entity that provided it. Even if git-annex does not implement tooling for that, it would be good to settle on a concept that can be interpreted/implemented by such special remotes.
I just want to mention that I've implemented/tried to implement something like this in https://github.com/matrss/datalad-getexec. It basically just records a command line invocation to execute and all required input files as base64-encoded json in a URL with a custom scheme, which made it surprisingly simple to implement. I haven't touched it in a while and it was more of an experiment, but other than issues with dependencies on files in sub-datasets it worked pretty well. The main motivation to build it was the mentioned use-case of automatically converting between file formats. Of course it doesn't address all of your mentioned points. E.g. trust is something I haven't considered in my experiments, at all. But it shows that the current special remote protocol is sufficient for a basic implementation of this.
I like the proposed "request one key, receive many" extension to the special remote protocol and I think that could be useful in other "unusual" special remotes as well.
I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.
"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.
Thanks for the pointer, very useful!
Regarding the points you raised:
Datalad's
run
feature has been around for some years, and we have seen usage in the wild with command lines that are small programs and dozens, sometimes hundreds of inputs. It is true that anything could be simply URL-encoded. However, especially with command-patterns (always same, except parameter change) that may be needlessly heavy. Maybe it would compress well (likely), but it still poses a maintenance issue. Say the compute instructions need an update (software API change): Updating one shared instruction set is a simpler task than sifting through annex-keys and rewriting URLs.We need a worktree different from
HEAD
whenever HEAD has changed from the original worktree used for setting up a compute instruction. Say a command needs two input files, but one has been moved to a different directory in currentHEAD
. An implementation would now either say "no longer available" and force maintenance update, or be able to provision the respective worktree. In case of no provision capability we would need to replace the URL-encoded instructions (this would make the key uncomputable in earlier versions), or amend with an additional instruction set (and now we would start to accumulate cruft where changes in the git-annex branch need to account for (unrelated) changes in any other branch).An interesting benefit of using URL keys for this is the recently added VURL keys in today's release, which work just like url keys, except a checksum gets calculated when the content is downloaded from the url. This allows
git-annex fsck
to verify the checksums, as well as letting the checksum be verified when transferring the content between repositories. (Seegit-annex addurl --verifiable
documentation.)And a nice thing about using URL or VURL keys for this is that it allows for both fully reproducible computations and computations that generate equivilant but not identical files. The latter corresponds to
git-annex addurl --relaxed
.If you use a VURL key and give it a size, then the checksum is calculated on first download from your compute special remote, and subsequent downloads are required to have the same checksum. Without a size, it's relaxed and anything your compute special remote generates is treated as effectively the same key, so there can be several checksums that git-annex knows about, attached to the same VURL key.
About worktree provisioning, couldn't you record the sha1 of the tree containing the files needed to generate the object, and then use
git worktree
to make a temporary checkout of that tree? You couldgit-annex get
whatever files are necessary within the temp worktree, which could result in recursive computations to get dependencies.I would be careful to avoid dependency cycles though..
On trust, it seems to me that if someone chooses to install a particular special remote, they are choosing to trust whatever kind of computations it supports.
Eg a special remote could choose to always run a computation inside a particular container system and then if you trust that container system is secure, you can choose to install it.
Enabling the special remote is not necessary, because a repository can be set to autoenable a special remote. In some sense this is surprising. I had originally talked about enabling here and then I remembered autoenable.
It may be that autoenable should only be allowed for special remote programs that the user explicitly whitelists, not only installs into PATH. That would break some existing workflows, though setting some git configs would not be too hard.
There seems scope for both compute special remotes that execute code that comes from the git repository, and ones that only have metadata about the computation recorded in the git repository, in a way that cannot let them execute arbitrary code under the control of the git repository.
A well-behaved compute special remote that does run code that comes from a git repository could require an additional git config to be set to allow it to do that.
About request one, receive many, it would be possible for a special remote to run eg
git-annex reinject --guesskeys
to move additional generated object files into .git/annex/objects/.(Doesn't datalad do something like that when it download and unpacks a tarball that contains several annexed files besides the one it was running the download to get? Or perhaps it only stores the tarball in the annex and unpacks it several times?)
(I forgot to tick "email replies to me", sorry for the late reply)
My reasoning for suggesting to always stay in HEAD is this: Let's assume we have a file "data.grib" that we want to convert into "data.nc" using this compute special remote. We use its facilities to make it do exactly that. Now, if there was a bug in "data.grib" that necessitates an update, we would replace the file. The special remote could do two things then:
I think the first error is preferable over the second, because the second one is much more subtle and easy to miss.
This same reasoning extends to software as well, if it is somehow tracked in git: for the above mentioned conversion one could use "cdo" (climate data operators). One could pin a specific version of "cdo" with nix and its flake.lock file, meaning that there is an exact version of cdo associated with every commit sha of the git-annex/DataLad repository. If I update that lock file to get a new version of cdo, then as a user I would naively assume that re-converting "data.grib" to "data.nc" would now use this new version of cdo. With worktree provisioning it would silently use the old one instead.
IMO worktree provisioning would create an explosion of potential inputs to consider for the computation (the entire git history so far), which would create a lot of subtle pitfalls. Always using stuff from HEAD would be an easier implementation, easier to reason about, and make the user explicitly responsible for keeping the repository contents consistent.
Circling back to this, I think the fork in the road is whether this is about git-annex providing this and that feature to support external special remotes that compute, or whether git-annex gets a compute special remote of its own with some simpler/better extension interface than the external special remote protocol.
Of course, git-annex having its own compute special remote would not preclude other external special remotes that compute. And for that matter, a single external special remote could implement an extension interface.
Thinking about how a generic compute special remote in git-annex could work, multiple instances of it could be initremoted:
Here the "program" parameter would cause a program like
git-annex-compute-ffmpeg-cut
to be run to get files from that instance of the compute special remote. The interface could be as simple as it being run with the key that it is requested to compute, and outputting the paths to the all keys it was able to compute. (So allowing for "request one key, receive many".) Perhaps also with some way to indicate progess of the computation.It would make sense to store the details of computations in git-annex metadata. And a compute program can use git-annex commands to get files it depends on. Eg,
git-annex-compute-ffmpeg-cut
could run:It might be worth formalizing that a given computed key can depend on other keys, and have git-annex always get/compute those keys first. And provide them to the program in a worktree?
When asked to store a key in the compute special remote, it would verify that the key can be generated by it. Using the same interface as used to get a key.
This all leaves a chicken and egg problem, how does the user add a computed file if they don't know the key yet?
The user could manually run the commands that generate the computed file, then
git-annex add
it, and set the metadata. Thengit-annex copy --to
the compute remote would verify if the file can be generated, and add it if so. This seems awkward, but also nice to be able to do manually.Or, something like VURL keys could be used, with an interface something like this:
All that would do is generate some arbitrary VURL key or similar, provisionally set the provided metadata (how?), and try to store the key in the compute special remote. If it succeeds, stage an annex pointer and commit the metadata. Since it's a VURL key, storing the key in the compute special remote would also record the hash of the generated file at that point.
Using metadata to store the inputs of computations like I did in my example above seems that it would allow the metadata to be changed later, which would change the output when a key gets recomputed. That feels surprising, because metadata could be changed for any reason, without the intention of affecting a compute special remote.
It might be possible for git-annex to pin down the current state of metadata (or the whole git-annex branch) and provide the same input to the computation when it's run again. (Unless
git-annex forget
has caused that old branch state to be lost..) But it can't fully isolate the program from all unpinned inputs without using some form of containerization, which feels out of scope for git-annex.Instead of using metadata, the input values could be stored in the per-special-remote state of the generated key. Or the input values could be encoded in the key itself, but then two computations that generate the same output would have two different keys, rather than hashing to the same key.
Using a key with a regular hash backend also lets the user find out if the computation turns out to not be reproducible later for whatever reason; getting the file from the compute special remote will fail at hash verification time. Something like a VURL key could still alternatively be used in cases where reproducibility is not important.
To add a computed file, the interface would look close to the same, but now the --value options are setting fields in the compute special remote's state:
The values could be provided to the "git-annex-compute-" program with environment variables.
For
--input source=foo
, it could look up the git-annex key (or git sha1) of that file, and store that in the state. So it would provide the compute program with the same data every time. But it could also store the filename. And that allows for a command like this:Which, when the input.mov file has been changed, would re-run the computation with the new content of the file, and stage a new version of the computed file. It could even be used to recompute every file in a tree:
Also, that command could let input values be adjusted later:
It would also be good to have a command that examines a computed key and displays the values and inputs. That could be
git-annex whereis
or perhaps a dedicated command with more structured output:This all feels like it might allow for some useful workflows...
@m.risse in your example the "data.nc" file gets new content when retrieved from the special remote and the source file has changed.
But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.
So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do.
In your "cdo" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.
My design so far does not fully support "Request one key, receive many".
My
git-annex addcomputed
command doesn't handle the case where a computation generates multiple output files. While thegit-annex-compute-
command's interface could let it return several computed files, addcomputed would only adds one file to the name that the user specifies. What is it supposed to do if the computation generates more than one? Maybe it needs a way to let a whole directory be populated with the files generated by a computation. Or a way to specify multiple files to add.And here's another problem: Suppose I have one very expensive computation that generates files foo and bar. And a second, less expensive computation, that also generates foo (same content) as well as generating baz. Both computations are run on the same compute special remote. Now if the user runs
git-annex get foo
, they will be unhappy if it chooses to run the expensive computation, rather than the less expensive computation.Since the per-special remote state for a key is used as the computation input, only one input can be saved for foo's key. So it wouldn't really be picking between two alernatives, it would just use whatever the current state for that key is.
True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.
This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.
Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way "better" than the old one and updating the repository to record this new key as that file).
Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.
Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.
Some thoughts regarding your ideas:
git annex get
doesn't work recursively across submodules/subdatasets though, anddatalad get
does not understand keys, just paths (at least so far).