compute special remote interface

The compute special remote uses this interface to run compute programs.

When an compute special remote is initremoted, a program is specified:

git-annex initremote myremote type=compute program=git-annex-compute-foo

The user adds an annexed file that is computed by the program by running a command like one of these:

git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz

security

Security is very important here, because a user who enables a compute special remote and runs git pull followed by git-annex get is running the compute program with inputs under the control of anyone who has commit access to the repository.

The contents of input files should be assumed to be untrusted, and so should the filenames of input and output files, as well as everything else passed to the program in ARGV and the environment.

The program should make sure that whatever user input is passed to it can result in only safe and expected behavior. The program should avoid exposing user input to the shell unprotected, or otherwise executing it. (Except when the program is explicitly running user input in some form of sandbox.)

program parameters and environment

Whatever values the user passes to git-annex addcomputed are passed to the program in ARGV, followed by any values that the user provided to git-annex initremote.

To simplify the program's option parsing, any value that the user provides that is in the form "foo=bar" will also result in an environment variable being set, eg ANNEX_COMPUTE_passes=10 or ANNEX_COMPUTE_--level=9.

The program is run in a temporary directory, which will be cleaned up after it exits. It may be run in a subdirectory of the temporary directory. This is done when git-annex addcomputed was run in a subdirectory of the git repository.

Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays.

If the program exits nonzero, nothing it computed will be stored in the git-annex repository.

input files

Before doing any computation, the program needs to communicate with git-annex about what input files it needs, and what output files it will generate.

The content of any file in the repository can be an input to the computation. The program requests an input by writing a line to stdout:

INPUT file.raw

Then it can read a line from stdin, which will be the path to the content (eg a .git/annex/objects/ path).

If the program needs multiple input files, it should output multiple INPUT lines first, and then read multiple paths from stdin. This allows retrieval of the inputs to potentially run in parallel.

If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit.

When git-annex addcomputed --fast is being used to add a computation to the git-annex repository without actually performing it, the response to each INPUT will be an empty line rather than the path to an input file. This can also happen when an input file is not available for whatever reason. In this case, the program should proceed with the rest of its output to stdout (eg OUTPUT and REPRODUCIBLE), but should not perform any computation.

output files

For each output file that it will compute, the program should write a line to stdout, indicating the name of the file that will be added to the git-annex repository by git-annex compute.

OUTPUT file.jpeg

Then it should read a line from stdin, which is the path, in the program's temporary directory, where it should write the output file. Often this will be the same filename, but it also may be a sanitized version. It's important to use that sanitized version to avoid path traversal attacks, as well as problems like filenames that look like dashed options. If there is a path traversal attack, the program's stdin will be closed without a path being written to it.

The program must write a regular file to the output file. Symlinks or other special files will not be accepted as output files.

If git-annex sees that an output file is growing, it will use its file size when displaying progress to the user. So if possible, the program should write the content to the file it is computing directly, rather than writing to somewhere else and renaming it at the end. But, if the program seeks around and writes out of order, it should write to a file somewhere else and rename it at the end.

other messages

As well as INPUT and OUTPUT described above, there are some other messages that the program can output. All of these are optional.

PROGRESS 50%

To indicate its current progress while performing the computation, the program can output lines like this. This is not needed if the program streams output to an output file.
REPRODUCIBLE

This indicates that the results of the computation are expected to be bit-for-bit reproducible. That makes git-annex addcomputed behave as if the --reproducible option is set.
SANDBOX

After outputting this line, the program can read a line from stdin that will be the path to the directory it should sandbox to (which corresponds to the top of the git repository, so may be above its working directory). Any INPUT lines that come after SANDBOX will have input files be provided via paths that are inside the sandbox directory. Usually that is done by making hard links, but it will fall back to copying annexed files if the filesystem does not support hard links.
INPUT-REQUIRED

This works the same as INPUT, except when git-annex addcomputed --fast is being used to add a computation to the git-annex repository without actually performing it, the input file will be provided as a response to this, rather than the empty line provided as a response to INPUT.

If the input file is not available for some reason, an empty line will still be provided as a response to this.

example

An example git-annex-compute-foo shell script follows:

#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
    echo "Usage: convert input output [passes=n]" >&2
    exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
    ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
read output
echo REPRODUCIBLE

if [ -n "$input" ]; then
    frobnicate --passes="$ANNEX_COMPUTE_passes" -i "$input" -o "$output" >&2
fi