Hi,
this is a minor issue and probably there is no better solution, but nevertheless I would like to point it out and maybe discuss a little about the issue.
Given that the symlinks generated by annex are pretty large in size (they point to a file named by a large hash number), ext4 is using an entire block (4K) of storage instead of embedding the symlink into the inode itself. For the "archivist use case" of annex, this might lead to tens or hundreds of MBs of disk occupied by symlinks which actually don't add up to more than a few MBs.
Here is a real world example:
(ins)carlos@carlos home$ du -hs music/
56M music/
(ins)carlos@carlos home$ du -bhs music/
3.3M music/
(ins)carlos@carlos home$ ln -s /tmp/x x
(ins)carlos@carlos home$ du x
0 x
(ins)carlos@carlos home$ ln -s /tmp/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
(ins)carlos@carlos home$ du xx
4 xx
Cheers, Carlos
Just a bit :P (yes, that is 2.7G of symlinks so far)
You also get better seek speed with packed inodes.
With default 256 byte inodes, there seems to be 59 bytes to play with. (Determined experimentally.)
Note that disks over 4 tb default to 32 kilobyte inodes, so probably most spinning hard disks these days do pack regular git-annex symlinks efficiently. (I don't have a 4 tb disk online to check this.. And I doubt CandyAngel was counting only the sizes of symlinks and not git repos or at least directory inodes to hold all the symlinks.)
With a prefix like ".git/annex/objects/zX/Wx/S-s1000000000-" that leaves 20 bytes out of the 59 for the hash.
That's not enough data to be cryptographically secure, but if we use SHA1 or MD5 as the base hash, it wouldn't be anyway. 15 bytes of hash state will base64 encode to 20 bytes. SHA1 is a 20 byte hash; MD5 is a 16 byte hash. So even MD5 would need to be truncated a little bit. Chances of (non-malicious) collision would still be small, only 256 times as likely as a (non-malicious) MD5 collision. It could easily be made harder than MD5/SHA1 to maliciously collide by using truncated SHA2.
(Files larger than 9.3 gb would still have too long symlinks due to the size field. The size field could also be omitted or encoded more efficiently, but omitting it would reduce git-annex's ability to not overfill disk and I don't think re-encoding buys enough to bother.)
My analysis above assumes no subdirectories.
To leave space for even a single "../" would need to drop to 13 bytes of hash state. 1/79228162514264337593543950336 chance of 2 files colliding. Not comfortable with something so worse than md5, and that still doesn't help when files are 2 directories deep. Droping to 11 bytes for that, 1/1208925819614629174706176 chance is starting to get into could really happen territory.
In that repository, it is only top level directories (no sub directories) and each directory in it only has symlinks (up to 8000 of them). Directories are mkdir $(uuidgen -r), hence the wildcard for du.
It would be including the directory size to hold all the inodes, but it definitely isn't counting .git as this annex spans 3 drives with 6TB of content so far. Well, 6 drives because of "numcopies 2" :P
I will calculate this a different way and only count symlinks, when I have access to it again.
Just over a million symlinks.. very convenient
And in comparison to my earlier comment 2 weeks ago:
So directory inode sizes are dwarfed by the 4K disk usage but ~198b actual usage of the symlinks (~96% wasted space?).
Oops,
should have been
That'll teach me to prematurely copy it :P
Note that the analysis in my earlier comment assumes that the .git/annex/objects/xx/yy/key/ directory is removed. As long as those per-key directories are used, the symlinks cannot possibly be made short enough to pack.
There have been some other requests for that (datalad requested it because all those per-key directories use disk space, add to the size of the git repo, and slow down traversal). However, git-annex relies on those directories to prevent accidential rm -rf deleting the annexed objects and prevent some symlink following programs from editing/corrupting the annexed objects (the per-key directories are left mode 400 most of the time). So it would be fairly complicated to add a tuning that eliminated those while locking down the permissions some other way (eg, making the
yy
directories mode 400 except when one or more thread/process needs to write to them), and since it would have to be a tuning, it would introduce a lot of conditional complexity into the code.