Here's how to set up a local cache of annexed files, that can be used to avoid repeated downloads.
An example use case: Your CI system is operating on a git-annex repository,
so every time it runs it makes a fresh clone of the repository and uses
git-annex get
to download a lot of data into it.
We'll create a cache repository, set it as a remote of the other git-annex repositories, and configure git-annex to check the cache first before other more expensive ways of retrieving content. The cache can be cleaned out whenever you like with simple unix commands.
Some other nice properties -- When used on a system like BTRFS with COW support, content from the cache can populate multiple other repositories without using any additional disk space. And, git-annex repositories that are otherwise unrelated can share use of the cache if they happen to contain a common file.
You'll need git-annex 6.20180802 or newer to follow these instructions.
creating the cache
First let's create a new, empty git-annex repository. It will be put in ~/.annex-cache in the example, but for best results, put it in the same filesystem as your other git-annex repositories.
git init --bare ~/.annex-cache
cd ~/.annex-cache
git annex init
git config annex.hardlink true
git annex untrust here
The cache does not need to be a git annex repository; any kind of special remote can be used as a cache too. But, using a git repository lets annex.hardlink be used to make hard links between the cache and repositories using it.
The cache is made untrusted, because its contents can be cleaned at any time; other repositories should not trust it to retain content.
making repositories use the cache
Now in each git-annex repository that you want to use the cache, add it as a remote, and configure it as follows:
cd my-repository
git remote add cache ~/.annex-cache
git config remote.cache.annex-speculate-present true
git config remote.cache.annex-cost 10
git config remote.cache.annex-pull false
git config remote.cache.annex-push false
git config remote.cache.fetch do-not-fetch-from-this-remote:
The annex-speculate-present setting is the essential part. It makes git-annex know that the cache repository may contain the content of any annexed file. So, when getting a file, git-annex will try the cache repository first.
The low annex-cost makes git-annex try to get content from the cache remote before any other remotes.
The annex-pull and annex-push settings prevent git-annex sync
from
pulling and pushing to the remote, and the remote.cache.fetch setting
further prevents git commands from fetching from it or pushing to it. The
cache repository will remain an empty git repository (except for the
content of annexed files). This means that the same cache can be used with
multiple different git-annex repositories, without intermingling their git
data.
populating the cache
For the cache to be used, you need to get file contents into it somehow. A simple way to do that is, in a git-annex repository that already contains the content of files:
git annex copy --to cache
You could run that anytime after you get content. There are also ways to automate it, but getting some files into the cache manually is a good enough start.
cleaning the cache
You safely can remove content from the cache at any time to free up disk space.
To remove everything:
cd ~/.annex-cache
git annex drop --force
To remove files that have not been requested from the cache for the past day:
cd ~/.annex-cache
git annex drop --force --not --accessedwithin=1d
automatically populating the cache
The assistant can be used to automatically populate the cache with files that git-annex downloads into a repository.
more caches
The example above used a local cache on the same system. However, it's also possible to have a cache repository shared amoung computers on a LAN.
Sweet! Thank you Joey
The main issue so far detected is that if it is a parallel download (we have it as a default in datalad), it doesn't consider cache:
I am still digesting either having cache operations/state reflected in git-annex branch is a ok or not so ok (whenever # of files is large etc) thing
Hi Joey,
Would it be safe to init that repo with those
tunables
such as-c annex.tune.objecthash1=true -c annex.tune.branchhash1=true
to save some inodes etc? Any other tunable which might be of benefit (I still hope that I will see the time whenever the "KEY/" directory would be gone ;-))? I've tried with those two above (although annex.tune.branchhash1=true is probably irrelevant here) and it seems to do the right thing (at least for the objecthash1), but I just wanted to make sure I am not shooting myself into the foot.I have two cache repos --
cache
is just a regular one andcache2
with those tuned up parameters:and it didn't merge any of those which is good -- we do not want a possibly monstrous history of the cache to be merged into every repo using it
But then when I remove that
cache2
git-annex does merge it:which imho shouldn't happen -- annex shouldn't merge "cache" histories into this repository git-annex history. I guess there should be one more config option to set for those remotes?
Parallel downloads will use the cache repository for everything if it has a lower cost than other repositories. That's why the cost is set to 10 in the example. If it has the same cost as another repository, parallel downloads will spread the load between them. (This also means you can have multiple caches with the same cost and distribute load amoung them..)
You should never be pulling from the cache repo, so there should be nothing to merge from it. That's what the remote.cache.annex-pull is there to prevent git-annex sync doing, but you also have to avoid pulling from it yourself.
Using tunables with the cache does seem to work. Since all remotes usually have the same tunables as the local repo, there could potentially be bugs (or optimisations?) where it applies the local tunables to the remote, but in a little testing it seemed to work.
Sorry - I am still missing.
I followed your example so the cost for cache is 10, whenever for web it is default 200:
but it does download from the web in parallel download case -- so what am I missing?
nothing in --debug output hints on the costs:
I think we do call out to
annex merge
from time to time to update information about annex objects availability from any remote it might want to do so. Sincesync
does more we avoid using it for those cases.git annex merge
doesn't even care about any argument given to it, so we cannot simply avoid calling it oncache
remotes by specifying all other remotes. Would it be possible to get some option--only-pullable
or alike to make it prevent merging "caches"?Well, git-annex merge does not fetch, it only merges refs it sees. With the configuration I gave in the tip, you will not have a cache/git-annex branch for it to merge.
That is correct! My alias to fetch all remotes (useful to quickly update on the current state of development in feature branches of others) fetched the cache as well. Despite viral nature of git tags I consider it to be a good general approach. But fetching is not merging -- I can remove any of those remotes at any moment happen some remote became too heavy or smth like that (tags are trickier).
IMHO
annex merge
should also not merge those remotes which are not "pullable" by default. May be it could take remote name(s) as its argument(s) to merge only specified ones (ATM arguments seems to be silently ignored), happen someone really need to merge somehow any of those. That would prevent accidental blow up of the git-annex branch in case cache remote gets fetched.The -J2 web bug was not related to caching remotes at all but was an accidental sort by remote uuid rather than cost. I've fixed it.
I fear that preventing merging of branches fetched from the cache remote in git-annex would be a game of whack-a-mole. There are just too many ways the user could bypass such protections. Including, for example, configuring git to fetch from cache to origin/ tracking branches.
I remember at some point discussing isolating repos from one-another so that data from one repo can't leak across a boundary to another repo, while still having it be a remote, and it was similarly just not tractable. Can't seem to find the thread, but it's basically the same problem.
If you do accidentially merge the git-annex branch from a cache remote, you can always make it dead and use git-annex forget --drop-dead.
If you really want to avoid any possibility of git fetching from the caching remote, make it a directory special remote! But, there is not currently any way to make annex.hardlink work for directory special remotes, so it will be less efficient.
My concern is not really about making it impossible, but about making it unlikely or avoidable. It is as similar as you cannot avoid completely someone merging
git-annex
branch "manually" using regular git-merge with some -Sours to "avoid" the conflicts. It is unavoidable but very unlikely ATM my problem is "likely" (as likely as me, the first user of the feature, ran into this problem right away) and "unavoidable" (annex merge
has no option/mode to avoid merging those). As long as we could avoid it somehow (e.g. by providing some option toannex merge
) in those situations, it would be great. My concern is that we cannot avoid it at all.yeap, we will add that information to some FAQ etc, very useful. But it might be a bit too late if we share that blown up git-annex branch publicly and people merge it into their git-annex'es. If someone is as advanced as configuring git with alternative fetch settings, they could indeed resort to this.
Found a remote.cache.fetch that will prevent most accidents, though of course the determined footgun script may find a way.
This would be super useful for my sorting annex! Being able to just drop the symlinks into other git-annex repositories and fetch the file from any of the "storage" ones (acting as caches) resolves ?transfer between git-annexes in a very neat and safe way!
I see that git-annex won't merge the cache's git-annex branch with this setup (which is great, because some of them are huge) but when a repo fetches a file from the cache, it will note that the cache repo had the file in the location log, right? Can this be avoided so the cache is entirely transient (once you remove the remote, there is no trace of it at all)?
It doesn't work well if the source of the copy is in a btrfs subvolume and the cache is in another subvolume of the same filesystem.
With that setup every file is really copied instead of using reflink=always.
I solved it currently by copying .git/annex/objects manual into the cache (
cp -a --reflink=always .git/annex/objects ~/.cache/annex/
and afterwards doing the git annex cp which recognice the existence of the objects.Hi Mowgli, could you please elaborate for a slow me -- are you saying that
--reflink=auto
is not causing CoW between different subvolumes of the same filesystem while--reflink=always
does?P.S. glad to see more of BTRFS & git-annex tandem users around
Well, I did not check reflink=auto. I just checked first with
git annex copy --to=cache
which simply duplicated every file. So there are two posibilities:I think that annex always uses
cp --reflink=auto
for local paths (cache remote was on a local path right?). I guess running with--debug
could have helped to resolve the mysteryBTW -- checked locally -
reflink=auto
seems to work nicely across subvolumes of the same BTRFS filesystem. "copying" gigabytes takes half a second or so (without reflink=auto - takes considerably longer)git-annex looks at the file's stat() and only if the device id is the same as the stat of the destination directory does it use
cp
. If you see it runningrsync
instead, it's under the perhaps mistaken impression that it's a cross-device copy.They are indeed not the same across subvolumes of the same BTRFS file system
cp
seems to just to attempt a cheap cloneand if that one fails, assumes that full copy is required:
BTW, why
rsync
instead of a regularcp
for local filesystem if it is across the devices?Hi Joey. What would be the preferred way you would advise to use (ideally with minimal manual configuration) to make it happen? I.e. whenever user(s)
get
some load, it gets automagicallyannex copied
to the cache?I tried following the recipe above using git-annex v8 and successfully made a cache to which I can write efficient hardlinks to my working repos, but I am unable to read them back the same way, as hardlinks.
This means that on a 10GB dataset, where
annex.thin
lets me use only those 10GB, adding the cache doubles it to 20GB. This is not really a feasible amount of overhead for my use-case.I've done a full report with test cases comparing different solutions (check the branches!) at https://github.com/kousu/test-git-annex-hardlinks.
There seem to be several tangled issues:
annex.hardlink
in the cache overridesannex.thin
in the working repo (despite the manpage clamingannex.thin
overridesannex.hardlink
),annex get
andannex copy
both want to do the equivalent of a one-stepfetch
andcheckout
and thecheckout
does a copy despiteannex.thin
being set.Either:
~/.annex-cache/.git/annex/objects <-> dataset/.git/annex/objects
but a copy betweendataset/.git/annex/objects <-> dataset/
, orannex.hardlink
in the cache, then vice-versa: a copy happens between~/.annex-cache/.git/annex/objects <-> dataset/.git/annex/objects
and a hardlink happens betweendataset/.git/annex/objects <-> dataset/
.In either case there's an extra full copy of my dataset, and I would rather not spend the time and space it takes to construct that every time I want to use my dataset somewhere.
I also tried
mv ~/.annex-cache/.git/annex dataset/.git/
but that just confusedgit-annex
fiercely.I also tried
git annex fix
but it just seemed to do nothing. And anyway isn't much help since I need to run it aftercopy
which has already done a wasteful copy. I thought maybefix
could at least recognize thatannex.thin
is set and undo the wasted copy but it doesn't.I managed to work around it by side-stepping
git-annex
withfind .git/annex/objects | ... | ln -f
directly. This seems to work, and to not confusegit-annex
too much -- it just makes an extra hardlink for some reason, but I can live with that.What's the most supported way to cache and directly use the data in the cache? That's one of the main features I want in a cache and I can't figure out how to do it with
git-annex
.Thanks for any pointers or clues towards getting this to work.