Currently, when fsck
'ing a remote, files are first downloaded to a temporary
file locally, decrypted if needed, and finally digested; the temporary file is
then either thrown away, or quarantined, depending on the value of that digest.
Whereas this approach works with any kind of remote, in the particular case
where the user is granted execution rights on the digest command, one could
avoid cluttering the network and digest the file remotely. I propose the
addition of a per-remote git option annex-remote-fsck
to switch between the
two behaviors.
There is an issue with encrypted specialremotes, though. As hinted at here, since the digest of a ciphertext can't be deduced from that of a plaintext in general one would needs, before sending an encrypted file to such a remote, to digest it and store that digest somewhere (together with the cipher's size and perhaps other meta-information).
The usual directory structure (.../.../{backend}-s{size}--{digest}.log
) seems
perfectly suitable to store these informations. Lines there would look like
{timestamp}s {numcopy} {UUID} {remote digest}
. Of course, it implies that
remote digest commands are trustworthy (are doing the right thing), and that
the digest output are not tampered by others who have access to the git repo.
But that's outside the current threat model, I guess.
Actually, since git-annex always includes a MDC in the ciphertexts, we could do
something clever and even avoid running a digest algorithm. According to the
OpenPGP standard the MDC
is essentially a SHA-1 hash of the plaintext. I'm still investigating if it's
even possible, but in theory it would be enough (with non-chained ciphers at
least) to download a few bytes from the encrypted remote, decrypt those bytes
to retrieve the hash, and compare that hash with the known value. Of course
there is a downside here, namely that files tampered anywhere but on the MDC
packets would not be detected by fsck
(but gpg will warn when decrypting the
file).
My 2 cents Is there something I missed? I suppose there was a reason to
perform fsck
locally at the first place...
The only reason fsck is done locally for remotes is ease of implementation and it being a generic operation that supports any kind of special remote.
Seems that the the only types of remotes where a remote fsck is a possibility are some rsync remotes and git remotes. git remotes already have git-annex installed, so the fsck could be run locally on the remote system using it.
I don't know if I see a benefit with the MDC check. Any non-malicious data corruption on the remote is likely to affect the body of the file and not the small portion that holds the MDC. So checking the MDC does not seem much better than the current existence check done by
git annex fsck --fast --from remote
.As for storing the remote digest on the git-annex branch, my initial reaction was just that it's potentially a lot of bloat. Thinking about it some more, when using non-shared encryption, there is currently no way, given just a clone of a git repository, to match up files in git with encrypted objects stored on a special remote. So storing the remote digest might be considered to weaken the security.
Oh yeah, the MDC paragraph was pretty much pointless indeed. Oops
I agree that this would potentially add some noise to the index, and weaken the security, but depending on the threat model and people's preferences that's an option that's worth considering IMHO.
git annex fsck
does not use it; only the assistant does.