Speed up syncing of modified versions of existing files.
One simple way is to find the key of the old version of a file that's being transferred, so it can be used as the basis for rsync, or any other similar transfer protocol.
For remotes that don't use rsync, use a rolling checksum based chunker, such as BuzHash. This will produce chunks, which can be stored on the remote as regular Keys -- where unlike the fixed size chunk keys, the SHA256 part of these keys is the checksum of the chunk they contain.
Once that's done, it's easy to avoid uploading chunks that have been sent to the remote before.
When retriving a new version of a file, there would need to be a way to get the list of chunk keys that constitute the new version. Probably best to store this list on the remote. Then there needs to be a way to find which of those chunks are available in locally present files, so that the locally available chunks can be extracted, and combined with the chunks that need to be downloaded, to reconstitute the file.
To find which chucks are locally available, here are 2 ideas:
- Use a single basis file, eg an old version of the file. Re-chunk it, and use its chunks. Slow, but simple.
- Some kind of database of locally available chunks. Would need to be kept up-to-date as files are added, and as files are downloaded.
bup splits files into chunks based on rolling sums. https://github.com/apenwarr/bup/blob/master/DESIGN For small changes to big files, this would improve not only transfer speed but also repository size. And git-annex already supports bup remotes. What's missing?
are there plans to have chunks stored in the regular backend storage?
i'm curious because one my use cases is archiving websites, where we end up with lots of WARC files. those files are basically a bunch of files from the website gzipp'd together in a stream, which means that multiple crawls of the same website (or actually, different website) have lots of redundant data (e.g. jQuery.js). storing those files in git-annex is not very efficient, because that data is duplicated all over the place.
if the storage backend was chunked, there could be massive deduplication across those files... this is why i looked at the ?borg special remote: I figured that i could at least deduplicate on the remote side, but it would still be nice to have this built-in! -- anarcat
The bup special remote does exist, so if you want to use that special remote, you can get efficient storage and transfer of related versions. It would probably be possible to make bup use the same git repo as git-annex, just storing its data in a separate branch, but I have not tried it.