It would be good if one could define custom external backends, the way one can define external custom remotes. This would solve ?consider meow backend but also have other uses. For instance, sometimes files contain details irrelevant to the file's semantics (e.g. comments), but that change the file's checksum; with a custom backend, one could "canonicalize" a file before computing the checksum.
@joey pointed out a potential problem: "needing to deal with the backend being missing or failing to work could have wide repurcussions in the code base." I wonder if there are ways around that. Suppose you specified a default backend to use in case a custom one was unavailable? Then you could always compute a key from a file, even if it's not in the right backend. And once a key is stored in git-annex, most of git-annex treats the key as just a string. If the custom backend supports checksum verification, without the backend's implementation, keys from that backend would be treated like WORM/URL keys that do not support checksum checking.
Thoughts?
The main operations on keys are:
Note that the last two of those are not currently part of the Backend object but would need to be moved to it to implement external backends.
Now, if a external backend is not installed or is broken, how would those operations failing be handled?
fsck and reinject would also fail, so if a repo contained content using an external backend and the program broken, fsck could move all the content to .git/annex/bad/
So I think that all these operations except for generate would need to have their type changed from Bool to Maybe Bool, and all code that uses them handle the case where the operation was unable to be performed.
For fsck, it might need to start checking files it had previously moved to bad, and move them back into the annex if they now verify due to the external backend having gotten fixed.
For downloads, if it can't verify, it might move to some holding location, and avoid re-downloading objects that are already located there. This would add more problems though; what if all the objects in that holding location use too much disk space? Should drop also drop them?
Or the external backend could be tested somehow before starting a download (or a fsck) and skip its key if it's not installed or broken. But such a test can't catch every possible breakage. It's hard to see how such a test can do more than checking that it can start the program and that the program seems to be speaking the right protocol version.
I suppose we could say that, if a external backend works to that extent but then breaks, the resulting bad behavior is its fault and not really the concern of git-annex. Or I could rationalize, that an external special remote can break in ways that eg, prevent content being moved to it, so it piles up in the local repo and uses too much disk space, and that this new potential breakage is not much worse -- though it seems the scenarios where it would have an adverse affect are likely to be more common.
Occurs to me that the same problems discussed above also can happen when a new hash is added to git-annex and is used in a repo, that then gets used by an older version of git-annex.
The blake2 hashes are one such.
So, external backends shouldn't really be blocked by that bad behavior, agurably.
Thanks a lot @joeyh for looking again at supporting external backends.
I think the issues with missing backend implementations are similar to issues with a missing external special remote implementation: something that the repo owner/maintainer needs to deal with. I periodically forget to put mine into the PATH, and then
git-annex
tells me to a special remote available. Fixing these issues has been manageable in practice.For reference, these are some of the todos that support for external backends would obviate: key checksum from chunk checksums ?MD5E keys without file size preserve file extensions in WORM and URL keys ?option to add user-specified string to key add xxHash backend .
P.S. It seems that in case of missing backend implementation, the handling should revert to exactly what currently happens for WORM/URL keys? Except with a different warning message, as custom backends are potentially verifiable.
If git-annex could support alternate keys for same content, then for a download from an external backend key you could also compute a standard checksum-based key (e.g. MD5 or SHA256), and record its presence in the remote you got the contents from, and the correspondence to the external backend key. Then, at least, if this contents gets uploaded somewhere else, it would be verifiable even without the external backend implementation.
annex.security.allow-unverified-downloads
, the user could enable verification by an external backend implementation which the user trusts.It seems reasonable to assume the user trusts the backend program as much as they do the git-annex program, when it comes to whether a hash is cryptographically secure. They're both programs the user has decided to use, which could do far more mischief than pretending that md5 is secure.
The suggestion that this could be used for ?option to add user-specified string to key raises its own security concerns. (Although git's sha1 collision hardening probably will survive until git sha256, so git-annex's attempts to prevent sha1 collisions via user-supplied data in the content of keys are probably unnecessary.)
Seems that this has a naming problem. Each backend needs a unique name. The length of the name has to be short enough to not make the key length excessive. The longest backend name in git-annex is "BLAKE2SP256E" (12). A UUID seems too long, domain name probably too long.
If two external backends picked the same name and the wrong one got installed, bad things could happen, like failing to verify content because it used an unexpected hash.
Maybe just have a name registry on this site, first come first served and if you chose to overlap, you get to keep all 100 pieces?
(Note that external special remote programs must have unique names too, which does not seem to have been a problem in practice.)
"the user trusts the backend program as much as they do the git-annex program, when it comes to whether a hash is cryptographically secure" -- I'd trust git-annex more because it has many more users than any one niche backend program, so gets more testing and scrutiny.
"just have a name registry on this site, first come first served" -- seems fair; there may be more complex/robust solutions but doubt they're needed in practice.
Wrote a draft external backend protocol.
I wonder if it makes sense to require the programs to format and parse their own keys; git-annex could break up the key and send the peices in. The advantage though is that this lets a program decide whether or not to include information like the size and mtime fields in the key or not. And if more fields ever got added it would not need changes to the protocol. I guess it's simple enough for format and parse, as shown by the example shell program that does it.
Moved isCryptographicallySecure into the Backend data structure. And it looks at whether verifyKeyContent is Nothing to decide if a given key is verifiable.
If an external backend is not installed at all (or fails to start up correctly or speaks an unknown protocol version), what can be done is make a Backend data structure where genKey is Nothing, verifyKeyContent is Nothing, isCryptographicallySecure is False, and isStableKey is False. When annex.verify=true, git-annex will refuse to download such keys, but that can be changed if necessary. (annex.securehashesonly too)
The isStableKey False will prevent chunking the key when storing on special remotes, but it can still be stored on them in unchunked form, the same as is done for URL keys. So, this seems like a reasonable enough fallback mode, although something will need to be done to warn the user about it. (Alternatively, could require that external backends generate stable keys, but that seems like it might get in the way of some things people might want to do with them.)
If an external backend is broken and replies to VERIFYKEYCONTENT with ERROR, or crashes, downloaded content would get thrown away when it fails to verify, as I discussed above.
VERIFYCONTENT
request, vs callingGENKEY
and comparing the result?GENKEY
may be a named pipe? Or, add aCANPIPE
request where the external backend program tells git-annex that it can take pipes; if the program can't, git-annex can always drain the pipe to a tempfile before passing it to the program.DEBUG
andINFO
requests as in the external special remote protocol?Nothing for remotes using a hash. However, if the remote is using something other than a hash, or a hash combined with something else, it might not be able to regenerate the same key. It may still be able to detect corrupted content, eg using the hash part of the key.
I can't think of any situation where git-annex would GENKEY before it has the full content of a file available.
For INFO I'd rather wait for a use case. None of the current backends ever need to display any messages, except for in the case of an exceptional error, eg a hardware faulure where hashing. And ERROR would be fine for that.
DEBUG sure.
git fsck --from REMOTE
without the data ever touching the disk.