http://git-annex.branchable.com/walkthrough/ says "Git wants to first stage the entire contents of the file in its index. That can be slow for big files (sorta why git-annex exists in the first place)."
What is git doing that git-annex isn't, other than copying the file to .git/objects rather than just moving it to .git/annex/objects, prepending it with "blob"+length, and compressing it? If git were changed to store the "blob"+length as part of the object filename rather than as part of the object file content, have a config option to use uncompressed objects for large files (and not try to pack them when creating pack files), and were used on a filesystem such as zfs or btrfs which does COW so the copy would be as fast as a move, then what speed advantage would git-annex still have over git? I realize git-annex has more features than just big file handling, and has the worm backend for even faster handling, but I'm just talking about the case with the default sha backend.
Have such changes been proposed for git? It seems that for anybody already familiar with the git codebase, adding the config option for uncompressed objects and moving the storage location for "blob"+length would be easy changes to make, and I see no downside to them. It wouldn't break backwards compatibility because the object filename being hash."blob".length rather than just hash would indicate that the new object format is in use, and a ".raw" filename extension could be used for uncompressed objects (or more sensibly, in the new format, no additional extension for uncompressed, and ".compressed" for compressed).
This would also eliminate the need for a git-annex object store separate from the git object store, and the complexities involved with having them separate, and the need for symlinks, and the complexities they cause. I don't think that relying on COW for speed is unreasonable once btrfs becomes the default in major Linux distros (the bsds already have zfs and hammerfs); right now part of what git-annex is doing is just working around the functional deficiency of non-COW filesystems.
P.S. I recommend a "plain" option for the page type when submitting comments on your wiki, so I don't have to put HTML line break markup at the end of my lines.
git's code base makes lots of assumptions hardcoding the size of the hash, etc. (grep its source for magic numbers 40 and 42...) I'd like to see git get parameratised hashes. SHA1 insecurity may evenutally push it in that direction. However, when I asked the git developers about this at the Gittogether last year, there were several ideas floated that would avoid parameterisation, and a lot of good thoughts about problems parameterised hashes would cause.
Moving data into git proper would still leave the problems unique to large data of not being able to store it all on every clone. Which means a git-annex like thing is needed to track where the data resides and move it around.
(BTW, in markdown, you separate paragraphs with blank lines. Like in email.)
What were the ideas to avoid parameterisation? What were the problems of parameterisation, other than just the current hardcoded assumptions?
Speaking of hash insecurity, http://static.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node8.html says compare-by-hash is a bad idea. As I understand, git doesn't have an option of verifying content matches when the hash matches when adding data to the object store (like zfs's "dedup=verify" option, which you can use even when using sha256), because the assumption is that the risk of collision (or at least just the risk of accidental collision) is negligible. Would it be worthwhile to add this option to git-annex?
I think my comment a couple days ago got caught in the spam filter, so I'm reposting. What were the ideas to avoid parameterisation? What were the problems of parameterisation, other than just the current hardcoded assumptions?
Speaking of hash insecurity, http://static.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node8.html says compare-by-hash is a bad idea. As I understand, git doesn't have an option of verifying content matches when the hash matches when adding data to the object store (like zfs's "dedup=verify" option, which you can use even when using sha256), because the assumption is that the risk of collision (or at least just the risk of accidental collision) is negligible. Would it be worthwhile to add this option to git-annex?
from my layman's standpoint, i think it would be feasible. i've suggested this previously, but not pushed it too much. quoting from my user page:
concerning hash sizes or parameterized hashes: the problems with hash sizes could be avoided if instead of putting the objects in the "normal" object dir, barefiles would be managed in a similar way as packs are. when a new files gets added, they'd be cow-copied to
.git/objects/bare/${HA}/${SH}
, and.git/objects/bareprefix/${HA}/${SH}
would contain the "blob ${SIZE}\0" prefix that gets concatenated to the object body to form the object itself.(maybe it'd even be sufficient to just store the size in the bareprefix, as all those objects would be blobs, but then again, some flexibility won't hurt.)
if the pack file format is flexible enough, the bareprefix files can get packed too. for the adventerous user who modifies bigfiles, the pack file mechanisms should be made aware of their presence, and be able to store deltas between them. the operations for applying those deltas would be difficult to optimize, and could be added at a later stage. a typical example could be storing a pdf file -- the pdf file format is designed for appending, so chances are the new version is just the old version plus several k at the end.
neither of that would affect git's wire protocol, so no compatibility problems. (it would be advisable to find a reasonable way to do sparse checkouts, though; something like "server, pack and send your master, but make it sparse and don't include blobs >1mb").