Hey, I'm not sure whether todo is the right place for feature requests.
When breaking files into chunks, it would be nice to have an option to hash the chunks based on their own content, rather than the content they are a part of. This would provide for deduplicating data.
I'm using git-annex with an append-only log where adding this would avoid the need to break logs into separate files to not reupload content. Separate files eventually taxes git's tree size.
If alternate keys for same content were implemented, git-annex could store chunk content using regular keys (providing deduplication) and mark the chunk keys as alternate keys for the same content.
I've also wanted this option, for a separate reason: to be able to quickly import files from a special remote that stores files in separate parts, and lets you read the checksums of the parts without downloading the files.
I had thought of implementing a custom special remote that would store, for a given key A, the key B of a manifest file that listed the keys C1, ..., Cn of chunks of the content of key A.
Hashing fixed sized chunks does not get very far toward deduplication, it really needs a rolling hash.
The chunk design was designed to support this kind of thing from the beginning, see chunks. Each chunk has its own Key, and the chunk log only needs to somehow store data to get from an object's Key to a list of chunk Keys. Which is currently just a chunk size and number of chunks, but could be a list of Keys, or a pointer to where to find the list.
So it's doable, but I have never quite found the tuits to attempt it.
Dropping does present a problem though -- If a single chunk is used by multiple keys, then dropping one key would drop the chunk and make the other key no longer be stored on the remote. So it would somehow need to check if the chunk is used by other keys still stored on the remote, and the chunk logs don't let that be done efficiently.