external backends

The main operations on keys are:

generate
verify
is stable (always refers to the same data, used only to determine if chunks can be used when storing the content on a special remote)
is verifiable with a hash (for annex.security.allow-unverified-downloads)
is cryptographically secure hash

Note that the last two of those are not currently part of the Backend object but would need to be moved to it to implement external backends.

Now, if a external backend is not installed or is broken, how would those operations failing be handled?

generate: Not a problem, git-annex already falls back to another backend if a backend fails to generate a key.
verify: This would need to return False. Since verification is done after download in the default configuration, this would make the downloaded content be thrown away. Seems likely it would result in repeated downloads of the same content when git-annex is run repeatedly, or when using the assistant.
fsck and reinject would also fail, so if a repo contained content using an external backend and the program broken, fsck could move all the content to .git/annex/bad/
is verifiable with a hash: Affects downloads same as verify.
is cryptographically secure hash: Similar to verify, but also affects uploads when annex.securehashesonly is set.

So I think that all these operations except for generate would need to have their type changed from Bool to Maybe Bool, and all code that uses them handle the case where the operation was unable to be performed.

For fsck, it might need to start checking files it had previously moved to bad, and move them back into the annex if they now verify due to the external backend having gotten fixed.

For downloads, if it can't verify, it might move to some holding location, and avoid re-downloading objects that are already located there. This would add more problems though; what if all the objects in that holding location use too much disk space? Should drop also drop them?

Or the external backend could be tested somehow before starting a download (or a fsck) and skip its key if it's not installed or broken. But such a test can't catch every possible breakage. It's hard to see how such a test can do more than checking that it can start the program and that the program seems to be speaking the right protocol version.

I suppose we could say that, if a external backend works to that extent but then breaks, the resulting bad behavior is its fault and not really the concern of git-annex. Or I could rationalize, that an external special remote can break in ways that eg, prevent content being moved to it, so it piles up in the local repo and uses too much disk space, and that this new potential breakage is not much worse -- though it seems the scenarios where it would have an adverse affect are likely to be more common.

Comment by joey — Wed Jun 26 14:54:01 2019

Remove comment

comment 2

Occurs to me that the same problems discussed above also can happen when a new hash is added to git-annex and is used in a repo, that then gets used by an older version of git-annex.

The blake2 hashes are one such.

So, external backends shouldn't really be blocked by that bad behavior, agurably.

Comment by joey — Wed Jun 26 15:55:36 2019

Remove comment

comment 3

Thanks a lot @joeyh for looking again at supporting external backends.

I think the issues with missing backend implementations are similar to issues with a missing external special remote implementation: something that the repo owner/maintainer needs to deal with. I periodically forget to put mine into the PATH, and then git-annex tells me to a special remote available. Fixing these issues has been manageable in practice.

For reference, these are some of the todos that support for external backends would obviate: ?key checksum from chunk checksums ?MD5E keys without file size preserve file extensions in WORM and URL keys ?option to add user-specified string to key add xxHash backend .

Comment by Ilya_Shlyakhter — Wed Jun 26 20:15:19 2019

Remove comment

some todos that external backends would obviate (fix formatting, add one)

?key checksum from chunk checksums
?MD5E keys without file size
preserve file extensions in WORM and URL keys
?option to add user-specified string to key
shorter keys through better encoding

Comment by Ilya_Shlyakhter — Wed Jun 26 20:25:50 2019

Remove comment

comment 5

P.S. It seems that in case of missing backend implementation, the handling should revert to exactly what currently happens for WORM/URL keys? Except with a different warning message, as custom backends are potentially verifiable.

If git-annex could support alternate keys for same content, then for a download from an external backend key you could also compute a standard checksum-based key (e.g. MD5 or SHA256), and record its presence in the remote you got the contents from, and the correspondence to the external backend key. Then, at least, if this contents gets uploaded somewhere else, it would be verifiable even without the external backend implementation.

Comment by Ilya_Shlyakhter — Wed Jun 26 20:38:19 2019

Remove comment

dockerized external backends

Similar to dockerized external special remotes, you could support dockerized external backends. You could run a docker container with networking disabled and with file access only to the to-be-annexed file(s), through a read-only volume. You could also require the docker image to be signed by someone you trust. Then git-annex repos with external-backend keys could still be self-contained.

Comment by Ilya_Shlyakhter — Fri Jun 28 16:36:10 2019

Remove comment

verifying and external backends

In fact, it seems you'd always have to treat external-backend keys as unverified by default, and require explicit configuration to override that? Even if an external backend claims to compute a cryptographically secure hash, you can't guarantee that it does. It seems that the handling should be exactly as for URL/WORM keys, except that, instead of completely disabling verification with annex.security.allow-unverified-downloads, the user could enable verification by an external backend implementation which the user trusts.

Comment by Ilya_Shlyakhter — Fri Jun 28 17:14:17 2019

Remove comment

re: verifying and external backends

It seems reasonable to assume the user trusts the backend program as much as they do the git-annex program, when it comes to whether a hash is cryptographically secure. They're both programs the user has decided to use, which could do far more mischief than pretending that md5 is secure.

The suggestion that this could be used for ?option to add user-specified string to key raises its own security concerns. (Although git's sha1 collision hardening probably will survive until git sha256, so git-annex's attempts to prevent sha1 collisions via user-supplied data in the content of keys are probably unnecessary.)

Comment by joey — Wed Jul 15 13:55:15 2020

Remove comment

comment 9

Seems that this has a naming problem. Each backend needs a unique name. The length of the name has to be short enough to not make the key length excessive. The longest backend name in git-annex is "BLAKE2SP256E" (12). A UUID seems too long, domain name probably too long.

If two external backends picked the same name and the wrong one got installed, bad things could happen, like failing to verify content because it used an unexpected hash.

Maybe just have a name registry on this site, first come first served and if you chose to overlap, you get to keep all 100 pieces?

(Note that external special remote programs must have unique names too, which does not seem to have been a problem in practice.)

Comment by joey — Wed Jul 15 17:52:03 2020

Remove comment

external backends

"the user trusts the backend program as much as they do the git-annex program, when it comes to whether a hash is cryptographically secure" -- I'd trust git-annex more because it has many more users than any one niche backend program, so gets more testing and scrutiny.

"just have a name registry on this site, first come first served" -- seems fair; there may be more complex/robust solutions but doubt they're needed in practice.

Comment by Ilya_Shlyakhter — Thu Jul 16 17:30:55 2020

Remove comment

comment 11

Wrote a draft external backend protocol.

I wonder if it makes sense to require the programs to format and parse their own keys; git-annex could break up the key and send the peices in. The advantage though is that this lets a program decide whether or not to include information like the size and mtime fields in the key or not. And if more fields ever got added it would not need changes to the protocol. I guess it's simple enough for format and parse, as shown by the example shell program that does it.

Comment by joey — Mon Jul 20 18:01:27 2020

Remove comment

comment 12

Moved isCryptographicallySecure into the Backend data structure. And it looks at whether verifyKeyContent is Nothing to decide if a given key is verifiable.

If an external backend is not installed at all (or fails to start up correctly or speaks an unknown protocol version), what can be done is make a Backend data structure where genKey is Nothing, verifyKeyContent is Nothing, isCryptographicallySecure is False, and isStableKey is False. When annex.verify=true, git-annex will refuse to download such keys, but that can be changed if necessary. (annex.securehashesonly too)

The isStableKey False will prevent chunking the key when storing on special remotes, but it can still be stored on them in unchunked form, the same as is done for URL keys. So, this seems like a reasonable enough fallback mode, although something will need to be done to warn the user about it. (Alternatively, could require that external backends generate stable keys, but that seems like it might get in the way of some things people might want to do with them.)

If an external backend is broken and replies to VERIFYKEYCONTENT with ERROR, or crashes, downloaded content would get thrown away when it fails to verify, as I discussed above.

Comment by joey — Mon Jul 20 18:06:54 2020

Remove comment

external backend protocol

What is the advantage of a separate VERIFYCONTENT request, vs calling GENKEY and comparing the result?
Can the protocol specify that the file passed to GENKEY may be a named pipe? Or, add a CANPIPE request where the external backend program tells git-annex that it can take pipes; if the program can't, git-annex can always drain the pipe to a tempfile before passing it to the program.
"While stderr is connected to the console and so visible to the user, the program should avoid using it" -- then maybe add DEBUG and INFO requests as in the external special remote protocol?

Comment by Ilya_Shlyakhter — Tue Jul 21 17:43:27 2020

Remove comment

comment 14

What is the advantage of a separate VERIFYCONTENT request, vs calling GENKEY and comparing the result?

Nothing for remotes using a hash. However, if the remote is using something other than a hash, or a hash combined with something else, it might not be able to regenerate the same key. It may still be able to detect corrupted content, eg using the hash part of the key.

Can the protocol specify that the file passed to GENKEY may be a named pipe

I can't think of any situation where git-annex would GENKEY before it has the full content of a file available.

add DEBUG and INFO requests

For INFO I'd rather wait for a use case. None of the current backends ever need to display any messages, except for in the case of an exceptional error, eg a hardware faulure where hashing. And ERROR would be fine for that.

DEBUG sure.

Comment by joey — Mon Jul 27 15:20:00 2020

Remove comment

streaming data and external backends

"I can't think of any situation where git-annex would GENKEY before it has the full content of a file available" -- if git-annex-cat were implemented, could then e.g. git fsck --from REMOTE without the data ever touching the disk.

Comment by Ilya_Shlyakhter — Thu Jul 30 15:58:21 2020

Remove comment