The compute special remote uses this interface to run compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=git-annex-compute-foo
The user adds an annexed file that is computed by the program by running a command like one of these:
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
security
Security is very important here, because a user who enables a compute
special remote and runs git pull
followed by git-annex get
is running
the compute program with inputs under the control of anyone who has
commit access to the repository.
The contents of input files should be assumed to be untrusted, and so
should the filenames of input and output files, as well as everything
else passed to the program in ARGV
and the environment.
The program should make sure that whatever user input is passed to it can result in only safe and expected behavior. The program should avoid exposing user input to the shell unprotected, or otherwise executing it. (Except when the program is explicitly running user input in some form of sandbox.)
program parameters and environment
Whatever values the user passes to git-annex addcomputed
are passed to
the program in ARGV
, followed by any values that the user provided to
git-annex initremote
.
To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg ANNEX_COMPUTE_passes=10
or ANNEX_COMPUTE_--level=9
.
The program is run in a temporary directory, which will be cleaned up after
it exits. It may be run in a subdirectory of the temporary directory. This
is done when git-annex addcomputed
was run in a subdirectory of the git
repository.
Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays.
If the program exits nonzero, nothing it computed will be stored in the git-annex repository.
input files
Before doing any computation, the program needs to communicate with git-annex about what input files it needs, and what output files it will generate.
The content of any file in the repository can be an input to the computation. The program requests an input by writing a line to stdout:
INPUT file.raw
Then it can read a line from stdin, which will be the path to the content
(eg a .git/annex/objects/
path).
If the program needs multiple input files, it should output multiple
INPUT
lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.
If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit.
When git-annex addcomputed --fast
is being used to add a computation to
the git-annex repository without actually performing it, the response to
each INPUT
will be an empty line rather than the path to an input file.
This can also happen when an input file is not available for whatever
reason. In this case, the program should proceed with the rest of its
output to stdout (eg OUTPUT
and REPRODUCIBLE
), but should not perform
any computation.
output files
For each output file that it will compute, the program should write a
line to stdout, indicating the name of the file that will be added to the
git-annex repository by git-annex compute
.
OUTPUT file.jpeg
Then it should read a line from stdin, which is the path, in the program's temporary directory, where it should write the output file. Often this will be the same filename, but it also may be a sanitized version. It's important to use that sanitized version to avoid path traversal attacks, as well as problems like filenames that look like dashed options. If there is a path traversal attack, the program's stdin will be closed without a path being written to it.
The program must write a regular file to the output file. Symlinks or other special files will not be accepted as output files.
If git-annex sees that an output file is growing, it will use its file size when displaying progress to the user. So if possible, the program should write the content to the file it is computing directly, rather than writing to somewhere else and renaming it at the end. But, if the program seeks around and writes out of order, it should write to a file somewhere else and rename it at the end.
other messages
As well as INPUT
and OUTPUT
described above, there are some other
messages that the program can output. All of these are optional.
PROGRESS 50%
To indicate its current progress while performing the computation, the program can output lines like this. This is not needed if the program streams output to an output file.
REPRODUCIBLE
This indicates that the results of the computation are expected to be bit-for-bit reproducible. That makes
git-annex addcomputed
behave as if the--reproducible
option is set.SANDBOX
After outputting this line, the program can read a line from stdin that will be the path to the directory it should sandbox to (which corresponds to the top of the git repository, so may be above its working directory). Any
INPUT
lines that come afterSANDBOX
will have input files be provided via paths that are inside the sandbox directory. Usually that is done by making hard links, but it will fall back to copying annexed files if the filesystem does not support hard links.INPUT-REQUIRED
This works the same as
INPUT
, except whengit-annex addcomputed --fast
is being used to add a computation to the git-annex repository without actually performing it, the input file will be provided as a response to this, rather than the empty line provided as a response toINPUT
.If the input file is not available for some reason, an empty line will still be provided as a response to this.
example
An example git-annex-compute-foo
shell script follows:
#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
echo "Usage: convert input output [passes=n]" >&2
exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
read output
echo REPRODUCIBLE
if [ -n "$input" ]; then
frobnicate --passes="$ANNEX_COMPUTE_passes" -i "$input" -o "$output" >&2
fi