git-annex now has support for tuning a repository for different work loads.
For example, a repository with a very large number of files in it may work
better if git-annex uses some nonstandard hash format, for either the
.git/annex/objects/
directory, or for the log files in the git-annex
branch.
A repository can currently only be tuned when it is first created; this is
done by passing -c name=value
parameters to git annex init
.
For example, this will make git-annex use only 1 level for hash directories
in .git/annex/objects
:
git -c annex.tune.objecthash1=true annex init
It's very important to keep in mind that this makes a nonstandard format git-annex repository. In general, this cannot safely be used with git-annex older than version 5.20150128. Older versions of git-annex will not understand and will get confused and perhaps do bad things.
Also, it's not safe to merge two separate git repositories that have been
tuned differently (or one tuned and the other one not). git-annex will
prevent merging their git-annex branches together, but it cannot prevent
git merge remote/master
merging two branches, and the result will be ugly
at best (git annex fix
can fix up the mess somewhat).
The following tuning parameters are available:
annex.tune.objecthash1=true
Use just one level of hash directories in.git/annex/objects/
, instead of the default two levels.annex.tune.objecthashlower=true
Make the hash directories in.git/annex/objects/
use all lower-case, instead of the default mixed-case.annex.tune.branchhash1=true
Use just one level of hash directories in the git-annex branch, instead of the default two levels.
Note that git-annex will automatically propagate these settings to
.git/config
for tuned repositories. You should never directly change
these settings in .git/config
, and should never set them in global
gitconfig.
it starts to use 2 levels (even if annex.tune.objecthash1=true) of hash directories having 3 characters in the filename at each level. So it is not just "taken existing hash directories (1 or 2 levels) and use their lower-case version. It is a different way to create the hash directories:
e.g. one with objecthas1=true
1 -> .git/annex/objects/qj/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8
and if I provide all three options at once:
1 -> .git/annex/objects/ccf/a40/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8
Right, it's not simply lower-casing but a different hash strategy as described in hashing.
Combining annex.tune.objecthashlower and annex.tune.objecthash1 will result in one level of hash directories. If you get two levels then you probabaly typoed "objecthas1" ...
My main use repo is 1.7TB large and holds 172.000+ annexed files. Variations in filename case has lead to a number of file duplications that are still not solved (I have base scripts that can be used to flatten filename case and fix references in other files, but it will probably mean handling some corner cases and there are more urgent matters for now).
For these reasons I'm highly interested in the lowercase option and I'm probably not the only one in a similar case.
Does migrating to a tuned repository mean unannexing everything and reimporting into a newly created annex, replica by replica then sync again? That's a high price in some setup. Or is there a way to somehow
git annex sync
between a newly created repo and an old, untuned one?It should be possible to write a
git-filter-branch
that converts a repository from one tuning to aonther, but it would not be trivial, and noone has done it yet. You'd still have to run it in every clone of the repository. Tuned and non-tuned repositories can't interoperate.annex.tune.objecthash1=true
andannex.tune.branchhash1=true
seem like they could be helpful in reducing git annex's inode usage, but the disclaimer about this feature being experimental is a little worrying.Since this it is over 10 years old though, is it still considered experimental or has it graduated to being a stable feature? I.e. will using this meaningfully increase the chance of losing data?
Also, what is the (potential) benefit of using lowercase for the hashes?
Naively, I put myself in a position where my rather large, untuned git-annex had to be recovered due to not appreciating the effect of case-insensitive filesystems.
Specifically, NTFS-3G is deadly in this case. Because, whilst Windows has advanced, and with WSL added the ability to add case-sensitivity on a folder, which is also inheritable to folders under it... NTFS-3G does not do this.
So beware if you try to work in an "interoperable" way. NTFS-3G will do mixed case, but will create child folders that are not case-sensitive.
To that end, I want to migrate this rather large git-annex to be tuned to annex.tune.objecthashlower. I already have a good strategy around this. I'll just create a completely new stream of git-annex'es originating from a newly formed one. I will also be able to create new type=directory special remotes for my "tape-out" existing git-annex. I will just use git annex fsck --fast --from $remote to rebuild the location data for it.
I've also tested this with an S3 git-annex as a proof-of-concept. So in the new git-annex, I ran git-annex initremote cloud type=S3... to create a new bucket, copied over a file from the old bucket, and rebuilt the location data for that file.
But I really really would like to be able to avoid creating a new bucket. I am happy to lose the file presence/location data for the old bucket, but I'd like to graft back in, or initremote the cloud bucket with matching parameters. So too I guess, with an encrypted special remote, ie. import over the encryption keys, etc.
Are there "plumbing" commands that can do this? Or does it require knowing about the low-level storage of this metadata to achieve it, which seems to just send me back to the earlier comment of using a filter-branch... which I am hoping to avoid (because of all the potential pit-falls)
I have found one way to graft in the S3 bucket. And that involves performing git-annex initremote cloud type=S3 , which unavoidably creates a new dummybucket (can use bucket=dummy to identify it). Then performing git-annex enableremote cloud bucket=cloud- to utilise the original bucket without having to copy/move over all the files.
I did try it in one shot with git-annex initremote cloud type=S3 bucket=cloud- , but unfortunately it fails because the creation of the bucket step appears mandatory, and the S3 api errors out with an "already created bucket" type of error.
However, if there is a general guidance somewhere for... I guess importing/exporting the special remote metadata (including stored encryption keys), that would be very much appreciated.
Sorry, I should just clarify. Trying to do this via sync from the old, non-tuned git-annex repo fails with:
Which I understand for the wider branch data implications... but I don't know enough to understand why just the special remote data can't be merge in.