This special remote type lets you store content in a remote of your own devising, configured via some simple hooks.
It's not recommended to use this remote type when another like rsync or directory will do. If your hooks are not carefully written, data could be lost.
If you're building a special remote for others to use, instead consider building an external special remote.
example
Here's a simple example that stores content on clay tablets. If you implement this example in the real world, I'd appreciate a tour next Apert! --Joey
# git config annex.cuneiform-store-hook 'tocuneiform < "$ANNEX_FILE" | tablet-writer --implement=stylus --title="$ANNEX_KEY" | tablet-proofreader | librarian --shelve --floor=$ANNEX_HASH_1 --shelf=$ANNEX_HASH_2'
# git config annex.cuneiform-retrieve-hook 'librarian --get --floor=$ANNEX_HASH_1 --shelf=$ANNEX_HASH_2 --title="$ANNEX_KEY" | tablet-reader --implement=coffee --implement=glasses --force-monastic-dedication | fromcuneiform > "$ANNEX_FILE"'
# git config annex.cuneiform-remove-hook 'librarian --get --floor=$ANNEX_HASH_1 --shelf=$ANNEX_HASH_2 --title="$ANNEX_KEY" | goon --hit-with-hammer'
# git config annex.cuneiform-checkpresent-hook 'librarian --find --force-distrust-catalog --floor=$ANNEX_HASH_1 --shelf=$ANNEX_HASH_2 --title="$ANNEX_KEY" --shout-title'
# git annex initremote library type=hook hooktype=cuneiform encryption=none
# git annex describe library "the reborn Library of Alexandria (upgrade to bronze plates pending)"
Can you spot the potential data loss bugs in the above simple example?
(Hint: What happens when the tablet-proofreader
exits nonzero?)
configuration
These parameters can be passed to git annex initremote
:
hooktype
- Required. This specifies a collection of hooks to use for this remote.encryption
- One of "none", "hybrid", "shared", or "pubkey". See encryption.keyid
- Specifies the gpg key to use for encryption.chunk
- Enables chunking when storing large files.
hooks
Each type of hook remote is specified by a collection of hook commands. Each hook command is run as a shell command line, and should return nonzero on failure, and zero on success.
These environment variables are used to communicate with the hook commands:
ANNEX_KEY
- name of a key to store, retrieve, remove, or check.ANNEX_FILE
- a file containing the key's contentANNEX_HASH_1
- short stable value, based on the key, can be used for hashing into 1024 buckets.ANNEX_HASH_2
- another hash value, can be used for a second level of hashing
The settings to use in git config for the hook commands are as follows:
annex.$hooktype-store-hook
- Command run to store a key in the special remote.ANNEX_FILE
contains the content to be stored.annex.$hooktype-retrieve-hook
- Command run to retrieve a key from the special remote.ANNEX_FILE
is a file that the retrieved content should be written to. The file may already exist with a partial copy of the content (or possibly just garbage), to allow for resuming of partial transfers.annex.$hooktype-remove-hook
- Command to remove a key from the special remote.annex.$hooktype-checkpresent-hook
- Command to check if a key is present in the special remote. Should output the key name to stdout, on its own line, if and only if the key has been actively verified to be present in the special remote (caching presence information is a very bad idea); all other output to stdout will be ignored.
combined hook program
Rather than setting all of the above hooks, you can write a single program that handles everything, and set a single hook to make it be used.
# git config annex.demo-hook /usr/local/bin/annexdemo
# git annex initremote mydemorepo type=hook hooktype=demo encryption=none
But, doing that is deprecated -- it's better, and not much harder, to write an external special remote!
If you still want to do this, the program just needs to look at the
ANNEX_ACTION
environment variable to see what it's being asked to do.
For example:
#!/bin/sh
set -e
case "$ANNEX_ACTION" in
store)
demo-upload "$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY" < "$ANNEX_FILE"
;;
retrieve)
demo-download "$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY" > "$ANNEX_FILE"
;;
remove)
demo-nuke "$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY"
;;
checkpresent)
if demo-exists "$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY"; then
echo "$ANNEX_KEY"
fi
;;
*)
echo "unknown ANNEX_ACTION: $ANNEX_ACTION" >&2
exit 1
;;
esac
Is there a way to use asynchronous remotes? Interaction with git annex would have to split the part of initiating some action from completing it.
I imagine I could
git annex copy
a file to an asynchronous remote and the command would almost immediately complete. Later I would learn that the transfer is completed, so the hook must be able to record that information in thegit-annex
branch. An additional plumbing command seems required here as well as a way to indicate that even though the store-hook completed, the file is not transferred.Similarly
git annex get
would immediately return without actually fetching the file. This should already be possible by returning non-zero from the retrieve-hook. Later the hook could use plumbing level commands to actually stick the received file into the repository.The remove-hook should need no changes, but the checkpresent-hook would be more like a trigger without any actual result. The extension of the plumbing required for the extension to the receive-hook could update the location log. A downside here is that you never know when a fsck has completed.
My proposal does not include a way to track the completion of actions, but relies on the hook to always complete them reliably. It is not clear that this is the best road for asynchronous hooks.
One use case for this would be a remote that is only accessible via uucp. Are there other use cases? Is the drafted interface useful?
Could you please include the original filename+path in the environment variables (next to ANNEX_KEY & ANNEX_FILE)? Like ANNEX_FILENAME and ANNEX_PATH.
Having these infos in a hook would help eg. a flickr backend to be more usefull. Tags would contain the ANNEX_KEY and the image title could be the original filename (ANNEX_FILENAME). Also, having directory path (ANNEX_PATH) for the given file, the uploading process could put images into the proper sets/collections. Voila, you have a "filesystem based" flickr image gallery (almost like flickrfs).
Other backends, like AmazonS3 having meta data also could benefit from this.
To build the Death Star further, an annex.$hooktype-sync-hook would instruct the backend to sync data, eg. place or move images/files in the proper image-sets/directories after they are moved/repositioned in git-annex, but that would be the backend's job, not git-annex's. Maybe the sync-hook would be called when git annex sync is called. This is just an idea.
While writing this, a new hook for sharing came into my mind: annex.$hooktype-share-hook. Calling this on a file/directory (git annex share my_image_to_share.jpg) would return a publicly shareable (short)url pointing to the file/directory. This would work for web-backends like AmazonS3, flickr, DropBox, Google Drive, ...
I have requested this before. But it doesn't seem to be entirely doable because some items may have multiple equally correct filenames/paths. And some items may have zero filenames/paths.
That said I hope a solution can be found because I really want this feature too. And would implement it in all my hooks.
And for some of the cases i don't really see it as an issue. If you have a public flickr repo with clean(unencrypted) files. It is because you want to access existing files. If an object has no filename/path the hook could/would/should just ignore the file, sure this means no backup of old versions of files, but you can have other backends for those versions.
The bigger issue is with the same file existing multiple places in the annex, which filename/path should be used? And the filename/path can change between sync(if it is deleted from one of the positions). I personally still see this as being entirely doable. The key for downloading would always be the same, so worst case scenario is the image may be duplicated on flickr. Or that the picture only one of the multiple folders it should be in on flickr. Still, i see these issues as being minor, and that usability would increase if this was implemented, even with these caveats.
And there probably is some issues I haven't realized/know about.
What value should be returned in the "checkpresent-hook" to signal that the given file does not exist in the given backend?
Should the called hook process return an exit code less or greater then zero? In this case the following is displayed:
This tells that the process failed (no internet connection or something that prevents the process from doing its job) and not that result is false, which would mean the file/entry does not exist in the given backend. If the return code is zero the file is treated as existing file/entry (no matter what I write to stderr).
Also I think, the "checkpresent" block misses the ending ;; in the example.
Here is my work-in-progress hook: https://gist.github.com/parhuzamos/31bf4516eea434e0d248
The checkpresent hook should always exit 0 unless there was an exceptional condition (eg, perhaps it cannot check if the file exists one way or the other). Like the documentation for it says, the important thing is what it outputs to stdout, which should contain the key name if it's present, and should not contain the key name if it's not present.
I hope you post to this website about your special remote when you get it fully working!
Roger that.
If this is acceptable: terminal output screenshot, than I'm almost done and will publish soon. (Of course a REST API using client would much be better, but this is just the start.)
If I were you I'd suppress that "File not found" error.
Hook special remotes can output messages to stderr, and it's also fine to output eg, progress bars to stdout when seding/receving files. But unnecessary cluttery output should be avoided.
In the current HEAD the "checkpresent" method in Hook.hs is missing a "return" while other hooks have a return value eg. Directory.hs, Rsync.hs, ...
I noticed that if my checkpresent hook does not output anything (to be exact: I commented out all the lines in the block) it behaves strangely. I copied the test file to "copyrepo", removed it by hand (so git-annex does not know about the change) and executed a fsck.
running the check again
and running the check again without --fast
It thinks, the file is in the repo but it is not.
The behavior you show with
fsck --from
is that the first time it's run against the damaged remote it notices the file is not present using the checkpresent hook. It then updates the location log. The subsequent times it's run, it sees that the location log says the file is not present in the remote. It verifies this is the case by calling the checkpresent hook. Since the two data sources agree, and numcopies is still satisfied, it prints "ok". There does not seem to be a bug here.(
return
in Haskell does not do what you would expect to happen in a traditional imperative language. It does not alter control flow, and any function usingreturn
can be mechanically converted to one that does not usereturn
.)