option to individually hash chunks

git-annex/ todo/ option to individually hash chunks

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Hey, I'm not sure whether todo is the right place for feature requests.

When breaking files into chunks, it would be nice to have an option to hash the chunks based on their own content, rather than the content they are a part of. This would provide for deduplicating data.

I'm using git-annex with an append-only log where adding this would avoid the need to break logs into separate files to not reupload content. Separate files eventually taxes git's tree size.

RSS Atom

re: individually hash chunks

If alternate keys for same content were implemented, git-annex could store chunk content using regular keys (providing deduplication) and mark the chunk keys as alternate keys for the same content.

I've also wanted this option, for a separate reason: to be able to quickly import files from a special remote that stores files in separate parts, and lets you read the checksums of the parts without downloading the files.

I had thought of implementing a custom special remote that would store, for a given key A, the key B of a manifest file that listed the keys C1, ..., Cn of chunks of the content of key A.

Comment by Ilya_Shlyakhter — Mon Mar 15 15:23:08 2021

Remove comment

another deduplication option

Comment by Ilya_Shlyakhter — Mon Mar 22 13:51:33 2021

Remove comment

comment 3

Hashing fixed sized chunks does not get very far toward deduplication, it really needs a rolling hash.

The chunk design was designed to support this kind of thing from the beginning, see chunks. Each chunk has its own Key, and the chunk log only needs to somehow store data to get from an object's Key to a list of chunk Keys. Which is currently just a chunk size and number of chunks, but could be a list of Keys, or a pointer to where to find the list.

So it's doable, but I have never quite found the tuits to attempt it.

Comment by joey — Tue Jul 13 17:09:11 2021

Remove comment

comment 4

Dropping does present a problem though -- If a single chunk is used by multiple keys, then dropping one key would drop the chunk and make the other key no longer be stored on the remote. So it would somehow need to check if the chunk is used by other keys still stored on the remote, and the chunk logs don't let that be done efficiently.

Comment by joey — Tue Jul 13 17:18:49 2021

Remove comment

Add a comment

Last edited Mon Mar 15 07:08:05 2021