git-annex could use linux's fsverify feature as an alternative to hashing and verifying hashes of files itself.
Benefits would include:
- Any read of an annexed file that uses fsverify would check the blocks that are read, and the read would fail if the file had gotten corrupted.
- Avoiding any theoretical cases where
git-annex add
is hashing a file and something modifies it, causing the file to be added with the wrong hash (whichgit-annex fsck
will later detect). TheFS_IOC_ENABLE_VERITY
ioctl prevents anything else from possibly modifying the file while it's hashing it. - Slightly faster git-annex fsck, because it would not need to hash verified files. It would suffice to read the file, and if it all read successfully, it's valid!
Since fsverify uses a merkle tree, its hashes are not the same as simply using SHA on the whole file. So for git-annex to use the fsverify hash as the key for the file, it would need to be a separate type of key. That's a bit problimatic because then git-annex would need a way to verify that merkle hash itself on systems that do not support fsverify. Also, for large files, the merkle tree can get relatively large (1/127th the size of the file the docs say). So with a terabyte of annexed files, that's gigabytes of merkle hashes, which seems too large to want to stote them in git.
Alternatively, git-annex could hash as usual for the key. This would mean
that git-annex add
would hash a file twice, once for the git-annex key
and the second time calling the FS_IOC_ENABLE_VERITY
ioctl. Slower, but
perhaps these could parallelize and only use 2x the CPU or so.
Since fsverified files are readonly, this would only be useful for locked files. Unlocking a file would need to either remove the fsverify from it (if possible?) or copy it.
Using fsverify in this way would not work if the sysctl
fs.verity.require_signatures
is set, because the annexed files would
not have signatures.
Putting all this together, fsverify is not too compelling for use by
git-annex. A user who wants the verification on all reads of a file can
just call FS_IOC_ENABLE_VERITY
on it themselves after git-annex add.
The annex.freezecontent-command hook could be used to to that.
Then the only benefit of supporting it in git-annex is that perhaps git-annex
add
could parallize enabling verification with checksumming, or avoid its
own checksumming, and so run faster than if a hook were used to enable
fsverify. And fsck would use less CPU. Is that worth complicating git-annex for?
--Joey
After investigating that, I currently don't think it's compelling, so I'm gonna close this. done --Joey