Do you ever check in original versions of files to git-annex, but then convert them in some way? Maybe you check in original photos from a camera, but then change them to a more useful file format, or smaller resolution. Or you clip a video file. Or you crunch some data to a result.

If you check the computed file into git-annex too, and store it on your remotes along with the original, that's a waste of disk space. But it is so convenient to be able to git-annex get the computed file.

The compute special remote is the solution to this. It "stores" the computed file by remembering how to compute it from input files. When you git-annex get the computed file from it, it re-runs the computation on the original input file to produced the computed file.

using the compute special remote

There are many compute programs that each handle some type of computation, and it's pretty easy to write your own compute program too. In this tip, we'll use git-annex-compute-imageconvert, which uses imagemagick to convert between image formats.

To follow along, install that program in PATH (and remember to make it executable!) and make sure you have imagemagick installed.

First, initialize a compute remote:

# git-annex initremote imageconvert type=compute program=git-annex-compute-imageconvert

Now suppose you have a file foo.jpeg, and you want to add a computed foo.gif to the git-annex repository.

# git-annex addcomputed --to=imageconvert foo.jpeg foo.gif

(The syntax of the git-annex addcomputed command will vary depending on the program that a compute remote uses. Some may have multiple input files, or multiple ouput files, or other options to control the computation. See the documentation of each compute program for details.)

Now you have foo.gif and can use it as usual, including copying it to other remotes. But it's already "stored" in the imageconvert remote, as a computation. So to free up space, you can drop it:

# git-annex drop foo.gif
drop foo.gif ok

By the way, you can also add a computed file to the repository without bothering to compute it yet! Just use --fast:

# git-annex addcomputed --fast --to=imageconvert bar.jpeg bar.gif

Now suppose you're in another clone of this same repository, and you want these gifs.

# git-annex get foo.gif
get foo.gif (not available)
  Maybe enable some of these special remotes (git annex enableremote ...):
    8332f7ad-d54e-435e-803b-138c1cfa7b71 -- imageconvert
failed

With git-annex-compute-imageconvert and imagemagic installed, all you need to do is enable the special remote to get the computed files from it:

# git-annex enableremote imageconvert
# git-annex get foo.gif
get foo.gif (from imageconvert...)
(getting input foo.jpeg from origin...)
ok

Notice that, when the input file is not present in the repository, getting a file from a compute remote will first get the input file.

That's the basics of using the compute special remote.

recomputation

What happens if the input file foo.gif is changed to a new version? Will getting foo.jpeg from the compute remote base it on the new version too? No. foo.gif is stuck on the original version of the input file that was used to compute it.

But, it's easy to recompute the file with a new version of the input file. Just git-annex add the new version of the input file, and then:

# git-annex recompute foo.gif
recompute foo.gif (foo.jpeg changed)
ok

You can use commands like git diff and git status to see the change that made to foo.gif.

# git status --short foo.gif
 M foo.gif

Now both the new and old versions of foo.gif are stored in the imageconvert remote, and it can compute either as needed.

reproducibility

You might be wondering, what happens if a computed file, such as foo.gif isn't exactly the same identical file each time it's computed? For example, what if there's a timestamp in there.

The answer is that, by default, files computed by a compute special remote are not required, or guaranteed to be bit-for-bit reproducible. One gif converted from a jpeg is much like any other converted from the same jpeg.

So git-annex usually treats all files computed in the same way from the same input as interchangeable. (Unless the compute program indicates that it produces reproducible files.)

Sometimes though, it's important that a file be bit-for-bit reproducible. And you can ask git-annex to enforce this for computed files. There is a --reproducible option for this, which you can pass to git-annex addcomputed or to git-annex recompute.

Let's switch the computed foo.gif to a reproducible file:

# git-annex recompute --original --reproducible foo.gif
recompute foo.gif
ok

You can git commit foo.gif to store this change.

But first, let's check if that computation actually is reproducible. This is easy, just drop it and get it from the compute remote again:

# git-annex drop foo.gif
drop foo.gif ok
# git-annex get foo.gif --from imageconvert
get foo.gif (from imageconvert...)
ok

If it turned out that the computation was not reproducible, getting the file would fail, like this:

# git-annex get foo.gif --from imageconvert
get foo.gif (from imageconvert...)
Verification of content failed

This is because a reproducible file uses a regular ?backend, which by default uses a hash to verify the content of the file.

If it does turn out that a file that was expected to be reproducible isn't, you can always convert it to an unreproducible file:

# git-annex recompute --original --unreproducible foo.gif
recompute foo.gif
ok

writing your own compute programs

There is a whole little protocol that compute programs use to communicate with git-annex. It's all documented at compute special remote interface.

But it's really easy to write simple ones, and you don't need to dive into all the details to do it. Let's walk through the code to git-annex-compute-imageconvert, which at 14 lines, is about as simple as one can be.

#!/bin/sh

It's a shell script.

set -e

If it fails to read input from standard input, or if a command fails, it will exit nonzero.

if [ -z "$1" ] || [ -z "$2" ]; then
    echo "Specify the input image file, followed by the output image file." >&2
    echo "Example: foo.jpg foo.gif" >&2
    exit 1
fi

It expects to be passed two parameters, which were "foo.jpeg" and "foo.gif" in the examples above. And it outputs some usage to stderr otherwise. That is displayed if the user runs git-annex addcomputed without the necessary filenames.

echo "INPUT $1"
read input

It tells git-annex that the first filename is the input file. And git-annex replies by telling it where the content of the input file is. This is the path to a git-annex object file.

echo "OUTPUT $2"
read output

It tells git-annex that the second filename is the output file. And git-annex replies by telling it the path it should write the output file to.

if [ -n "$input" ]; then

When git-annex addcomputed --fast is used, the program shouldn't actually read the input file or compute the output file. git-annex indicates this by not giving it a path to the input file. That's checked here.

    convert "$input" "$output" >&2

This uses convert from imagemagick, and just converts the input file to the output file.

Notice that stdout from convert is redirected to stderr. This is done because the compute program is speaking this protocol with git-annex over stdin and stdout, and we don't want random program output to mess that up.

fi

Closing the if above.

And that's all!

Now you know almost enough to write your own compute program. Editing this one will be a good start.

But first, a word about security.

A user who enables a compute special remote and runs git pull followed by git-annex get is running the compute program with inputs under the control of anyone who has commit access to the repository.

So, it's important that your compute program be secure. Please see the section on security in compute special remote interface for security considerations.

If you write a nice secure compute program, you can add it to the list in compute so other people can use it.