local caching of annexed files

Here's how to set up a local cache of annexed files, that can be used to avoid repeated downloads.

An example use case: Your CI system is operating on a git-annex repository, so every time it runs it makes a fresh clone of the repository and uses git-annex get to download a lot of data into it.

We'll create a cache repository, set it as a remote of the other git-annex repositories, and configure git-annex to check the cache first before other more expensive ways of retrieving content. The cache can be cleaned out whenever you like with simple unix commands.

Some other nice properties -- When used on a system like BTRFS with COW support, content from the cache can populate multiple other repositories without using any additional disk space. And, git-annex repositories that are otherwise unrelated can share use of the cache if they happen to contain a common file.

You'll need git-annex 6.20180802 or newer to follow these instructions.

creating the cache

First let's create a new, empty git-annex repository. It will be put in ~/.annex-cache in the example, but for best results, put it in the same filesystem as your other git-annex repositories.

git init --bare ~/.annex-cache
cd ~/.annex-cache
git annex init
git config annex.hardlink true
git annex untrust here

The cache does not need to be a git annex repository; any kind of special remote can be used as a cache too. But, using a git repository lets annex.hardlink be used to make hard links between the cache and repositories using it.

The cache is made untrusted, because its contents can be cleaned at any time; other repositories should not trust it to retain content.

making repositories use the cache

Now in each git-annex repository that you want to use the cache, add it as a remote, and configure it as follows:

cd my-repository
git remote add cache ~/.annex-cache
git config remote.cache.annex-speculate-present true
git config remote.cache.annex-cost 10
git config remote.cache.annex-pull false
git config remote.cache.annex-push false
git config remote.cache.fetch do-not-fetch-from-this-remote:

The annex-speculate-present setting is the essential part. It makes git-annex know that the cache repository may contain the content of any annexed file. So, when getting a file, git-annex will try the cache repository first.

The low annex-cost makes git-annex try to get content from the cache remote before any other remotes.

The annex-pull and annex-push settings prevent git-annex sync from pulling and pushing to the remote, and the remote.cache.fetch setting further prevents git commands from fetching from it or pushing to it. The cache repository will remain an empty git repository (except for the content of annexed files). This means that the same cache can be used with multiple different git-annex repositories, without intermingling their git data.

populating the cache

For the cache to be used, you need to get file contents into it somehow. A simple way to do that is, in a git-annex repository that already contains the content of files:

git annex copy --to cache

You could run that anytime after you get content. There are also ways to automate it, but getting some files into the cache manually is a good enough start.

cleaning the cache

You safely can remove content from the cache at any time to free up disk space.

To remove everything:

cd ~/.annex-cache
git annex drop --force

To remove files that have not been requested from the cache for the past day:

cd ~/.annex-cache
git annex drop --force --not --accessedwithin=1d

automatically populating the cache

The assistant can be used to automatically populate the cache with files that git-annex downloads into a repository.

more caches

The example above used a local cache on the same system. However, it's also possible to have a cache repository shared amoung computers on a LAN.

RSS Atom

is not going from cache with parallel get e.g. -J 2

Sweet! Thank you Joey

The main issue so far detected is that if it is a parallel download (we have it as a default in datalad), it doesn't consider cache:

$> git annex get -J1 sub-01/anat/sub-*_T1w.nii.gz      
get sub-01/anat/sub-01_T1w.nii.gz (from cache...) ok
(recording state in git...)

$> datalad drop sub-01/anat/sub-*_T1w.nii.gz
drop(ok): /home/yoh/datalad/openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file)

$> git annex get -J2 sub-01/anat/sub-*_T1w.nii.gz
get sub-01/anat/sub-01_T1w.nii.gz (from web...) 
22%   1.2 MiB         880 KiB/s 4s
16%   891.34 KiB      895 KiB/s 5s

I am still digesting either having cache operations/state reflected in git-annex branch is a ok or not so ok (whenever # of files is large etc) thing

Comment by yarikoptic — Thu Aug 2 13:33:08 2018

Remove comment

is it "safe" to tune?

Hi Joey,

Would it be safe to init that repo with those tunables such as -c annex.tune.objecthash1=true -c annex.tune.branchhash1=true to save some inodes etc? Any other tunable which might be of benefit (I still hope that I will see the time whenever the "KEY/" directory would be gone ;-))? I've tried with those two above (although annex.tune.branchhash1=true is probably irrelevant here) and it seems to do the right thing (at least for the objecthash1), but I just wanted to make sure I am not shooting myself into the foot.

Comment by yarikoptic — Thu Aug 2 13:53:04 2018

Remove comment

could be taken as a feature! but also annex should avoid merging cache git-annex

I have two cache repos -- cache is just a regular one and cache2 with those tuned up parameters:

$> git annex merge       
merge git-annex (merging cache/git-annex cache2/git-annex into git-annex...)
git-annex: Remote repository is tuned in incompatible way; cannot be merged with local repository.

and it didn't merge any of those which is good -- we do not want a possibly monstrous history of the cache to be merged into every repo using it

But then when I remove that cache2 git-annex does merge it:

$> git remote rm cache2
$> git annex merge     
merge git-annex (merging cache/git-annex into git-annex...)
(recording state in git...)
ok

which imho shouldn't happen -- annex shouldn't merge "cache" histories into this repository git-annex history. I guess there should be one more config option to set for those remotes?

Comment by yarikoptic — Thu Aug 2 14:30:51 2018

Remove comment

comment 4

Parallel downloads will use the cache repository for everything if it has a lower cost than other repositories. That's why the cost is set to 10 in the example. If it has the same cost as another repository, parallel downloads will spread the load between them. (This also means you can have multiple caches with the same cost and distribute load amoung them..)

You should never be pulling from the cache repo, so there should be nothing to merge from it. That's what the remote.cache.annex-pull is there to prevent git-annex sync doing, but you also have to avoid pulling from it yourself.

Using tunables with the cache does seem to work. Since all remotes usually have the same tunables as the local repo, there could potentially be bugs (or optimisations?) where it applies the local tunables to the remote, but in a little testing it seemed to work.

Comment by joey — Thu Aug 2 16:58:45 2018

Remove comment

re: parallel and costs

Sorry - I am still missing.
I followed your example so the cost for cache is 10, whenever for web it is default 200:

$> git annex info cache web | grep -e remote: -e cost
remote: cache
cost: 10.0
remote: web
cost: 200.0

but it does download from the web in parallel download case -- so what am I missing?

~/datalad/openfmri/ds000001 > datalad get -J 1 sub-01/anat/sub-*_T1w.nii.gz 
get(ok): /home/yoh/datalad/openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [from cache...]               

~/datalad/openfmri/ds000001 > git annex drop sub-01/anat/sub-*_T1w.nii.gz
drop sub-01/anat/sub-01_T1w.nii.gz (checking http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/highres001.nii.gz?versionId=8TJ17W9WInNkQPdiQ9vS7wo8ZJ9llF80...) ok
(recording state in git...)

~/datalad/openfmri/ds000001 > datalad get -J 2 sub-01/anat/sub-*_T1w.nii.gz  
get(ok): /home/yoh/datalad/openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]

nothing in --debug output hints on the costs:

~/datalad/openfmri/ds000001 > git annex get -J 2 --debug sub-01/anat/sub-*_T1w.nii.gz
[2018-08-02 13:28:03.896215705] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--cached","-z","--","sub-01/anat/sub-01_T1w.nii.gz"]
[2018-08-02 13:28:03.900141316] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2018-08-02 13:28:03.904139213] process done ExitSuccess
[2018-08-02 13:28:03.904230988] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2018-08-02 13:28:03.908376239] process done ExitSuccess
[2018-08-02 13:28:03.908608977] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..ff8578c5e3bdd1c67b2d9ca8082893fe6425f729","--pretty=%H","-n1"]
[2018-08-02 13:28:03.913502761] process done ExitSuccess
[2018-08-02 13:28:03.914221081] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-08-02 13:28:03.914683852] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-08-02 13:28:03.920509994] read: git ["config","--null","--list"]
[2018-08-02 13:28:03.925910945] process done ExitSuccess
get sub-01/anat/sub-01_T1w.nii.gz 
[2018-08-02 13:28:03.926689119] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-08-02 13:28:03.9274736] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objectty(from web...) ze)"]
76%   4.12 MiB        859 KiB/s 1s
73%   3.96 MiB        842 KiB/s 1s
...

Comment by yarikoptic — Thu Aug 2 17:28:46 2018

Remove comment

re: annex merge cache

but you also have to avoid pulling from it yourself.

I think we do call out to annex merge from time to time to update information about annex objects availability from any remote it might want to do so. Since sync does more we avoid using it for those cases. git annex merge doesn't even care about any argument given to it, so we cannot simply avoid calling it on cache remotes by specifying all other remotes. Would it be possible to get some option --only-pullable or alike to make it prevent merging "caches"?

Comment by yarikoptic — Thu Aug 2 17:35:22 2018

Remove comment

comment 7

Well, git-annex merge does not fetch, it only merges refs it sees. With the configuration I gave in the tip, you will not have a cache/git-annex branch for it to merge.

Comment by joey — Thu Aug 2 18:15:13 2018

Remove comment

re: annex merge cache

Well, git-annex merge does not fetch, it only merges refs it sees.

That is correct! My alias to fetch all remotes (useful to quickly update on the current state of development in feature branches of others) fetched the cache as well. Despite viral nature of git tags I consider it to be a good general approach. But fetching is not merging -- I can remove any of those remotes at any moment happen some remote became too heavy or smth like that (tags are trickier).

IMHO annex merge should also not merge those remotes which are not "pullable" by default. May be it could take remote name(s) as its argument(s) to merge only specified ones (ATM arguments seems to be silently ignored), happen someone really need to merge somehow any of those. That would prevent accidental blow up of the git-annex branch in case cache remote gets fetched.

Comment by yarikoptic — Thu Aug 2 18:49:51 2018

Remove comment

comment 9

The -J2 web bug was not related to caching remotes at all but was an accidental sort by remote uuid rather than cost. I've fixed it.

Comment by joey — Fri Aug 3 16:30:35 2018

Remove comment

comment 10

I fear that preventing merging of branches fetched from the cache remote in git-annex would be a game of whack-a-mole. There are just too many ways the user could bypass such protections. Including, for example, configuring git to fetch from cache to origin/ tracking branches.

I remember at some point discussing isolating repos from one-another so that data from one repo can't leak across a boundary to another repo, while still having it be a remote, and it was similarly just not tractable. Can't seem to find the thread, but it's basically the same problem.

If you do accidentially merge the git-annex branch from a cache remote, you can always make it dead and use git-annex forget --drop-dead.

If you really want to avoid any possibility of git fetching from the caching remote, make it a directory special remote! But, there is not currently any way to make annex.hardlink work for directory special remotes, so it will be less efficient.

Comment by joey — Fri Aug 3 17:18:04 2018

Remove comment

could we just make it "avoidable"?

There are just too many ways the user could bypass such protections. Including, for example, configuring git to fetch from cache to origin/ tracking branches.

My concern is not really about making it impossible, but about making it unlikely or avoidable. It is as similar as you cannot avoid completely someone merging git-annex branch "manually" using regular git-merge with some -Sours to "avoid" the conflicts. It is unavoidable but very unlikely ATM my problem is "likely" (as likely as me, the first user of the feature, ran into this problem right away) and "unavoidable" (annex merge has no option/mode to avoid merging those). As long as we could avoid it somehow (e.g. by providing some option to annex merge) in those situations, it would be great. My concern is that we cannot avoid it at all.

make it dead and use git-annex forget --drop-dead

yeap, we will add that information to some FAQ etc, very useful. But it might be a bit too late if we share that blown up git-annex branch publicly and people merge it into their git-annex'es. If someone is as advanced as configuring git with alternative fetch settings, they could indeed resort to this.

Comment by yarikoptic — Fri Aug 3 17:54:01 2018

Remove comment

comment 12

Found a remote.cache.fetch that will prevent most accidents, though of course the determined footgun script may find a way.

Comment by joey — Fri Aug 3 18:09:26 2018

Remove comment

comment 13

This would be super useful for my sorting annex! Being able to just drop the symlinks into other git-annex repositories and fetch the file from any of the "storage" ones (acting as caches) resolves ?transfer between git-annexes in a very neat and safe way!

I see that git-annex won't merge the cache's git-annex branch with this setup (which is great, because some of them are huge) but when a repo fetches a file from the cache, it will note that the cache repo had the file in the location log, right? Can this be avoided so the cache is entirely transient (once you remove the remote, there is no trace of it at all)?

Comment by CandyAngel — Fri Oct 26 11:35:19 2018

Remove comment

comment 14

I just tried this and git-annex doesn't add the cache in the location tracking for the files. Perfect! Yay!

Comment by CandyAngel — Sat Oct 27 22:33:20 2018

Remove comment

reflink and subvolume

It doesn't work well if the source of the copy is in a btrfs subvolume and the cache is in another subvolume of the same filesystem.

With that setup every file is really copied instead of using reflink=always.

I solved it currently by copying .git/annex/objects manual into the cache (cp -a --reflink=always .git/annex/objects ~/.cache/annex/ and afterwards doing the git annex cp which recognice the existence of the objects.

Comment by Mowgli — Wed Nov 21 19:01:11 2018

Remove comment

comment 16

Hi Mowgli, could you please elaborate for a slow me -- are you saying that --reflink=auto is not causing CoW between different subvolumes of the same filesystem while --reflink=always does?

P.S. glad to see more of BTRFS & git-annex tandem users around ;-)

Comment by yarikoptic — Wed Nov 21 19:21:48 2018

Remove comment

reflink and subvolume

Well, I did not check reflink=auto. I just checked first with git annex copy --to=cache which simply duplicated every file. So there are two posibilities:

reflink=auto doesn't work (I did not check but I think it work)
git-annex does not recognise the other path to be on the same filesystem so went back to simply copy (very likely but joey has to check.)

Comment by Mowgli — Wed Nov 21 19:41:30 2018

Remove comment

comment 18

I think that annex always uses cp --reflink=auto for local paths (cache remote was on a local path right?). I guess running with --debug could have helped to resolve the mystery ;-)

BTW -- checked locally - reflink=auto seems to work nicely across subvolumes of the same BTRFS filesystem. "copying" gigabytes takes half a second or so ;-) (without reflink=auto - takes considerably longer)

Comment by yarikoptic — Wed Nov 21 19:56:45 2018

Remove comment

comment 19

git-annex looks at the file's stat() and only if the device id is the same as the stat of the destination directory does it use cp. If you see it running rsync instead, it's under the perhaps mistaken impression that it's a cross-device copy.

Comment by joey — Mon Dec 3 17:34:10 2018

Remove comment

more details on coreutils cp treatment of reflink

git-annex looks at the file's stat() and only if the device id is the same

They are indeed not the same across subvolumes of the same BTRFS file system

$> time cp --reflink=auto home/yoh/reprotraining.ova scrap/tmp
cp --reflink=auto home/yoh/reprotraining.ova scrap/tmp  0.00s user 0.00s system 92% cpu 0.004 total

$> stat home/yoh/reprotraining.ova scrap/tmp/reprotraining.ova
  File: home/yoh/reprotraining.ova
  Size: 5081213952  Blocks: 9924248    IO Block: 4096   regular file
Device: 2fh/47d Inode: 61771704    Links: 1
Access: (0600/-rw-------)  Uid: (47521/     yoh)   Gid: (47522/     yoh)
Access: 2018-06-14 19:23:25.000000000 -0400
Modify: 2018-06-11 15:35:57.000000000 -0400
Change: 2018-06-14 19:23:25.891351983 -0400
 Birth: -
  File: scrap/tmp/reprotraining.ova
  Size: 5081213952  Blocks: 9924248    IO Block: 4096   regular file
Device: 30h/48d Inode: 190040764   Links: 1
Access: (0600/-rw-------)  Uid: (47521/     yoh)   Gid: (47522/     yoh)
Access: 2019-03-06 10:38:02.610657786 -0500
Modify: 2019-03-06 10:38:02.610657786 -0500
Change: 2019-03-06 10:38:02.610657786 -0500
 Birth: -

cp seems to just to attempt a cheap clone

/* Perform the O(1) btrfs clone operation, if possible.
   Upon success, return 0.  Otherwise, return -1 and set errno.  */
static inline int
clone_file (int dest_fd, int src_fd)
{
#ifdef FICLONE
  return ioctl (dest_fd, FICLONE, src_fd);
#else
  (void) dest_fd;
  (void) src_fd;
  errno = ENOTSUP;
  return -1;
#endif

and if that one fails, assumes that full copy is required:

  /* --attributes-only overrides --reflink.  */
  if (data_copy_required && x->reflink_mode)
    {
      bool clone_ok = clone_file (dest_desc, source_desc) == 0;
      if (clone_ok || x->reflink_mode == REFLINK_ALWAYS)
        {
          if (!clone_ok)
            {
              error (0, errno, _("failed to clone %s from %s"),
                     quoteaf_n (0, dst_name), quoteaf_n (1, src_name));
              return_val = false;
              goto close_src_and_dst_desc;
            }
          data_copy_required = false;
        }
    }

BTW, why rsync instead of a regular cp for local filesystem if it is across the devices?

Comment by yarikoptic — Wed Mar 6 16:00:35 2019

Remove comment

FTR: a dedicated issue on CoW across subvolumes

bugs/Uses_rsync_instead_of96cp--reflink61auto96_on_volumes_of_the_same_BTRFS_partition/

Comment by yarikoptic — Mon Jul 8 19:05:45 2019

Remove comment

preferred way to automate population of the cache upon `get`

# Populating the cache
For the cache to be used, you need to get file contents into it somehow. A simple way to do that is, in a git-annex repository that already contains the content of files:

    git annex copy --to cache

You could run that anytime after you get content. There are also ways to automate it, but getting some files into the cache manually is a good enough start.

Hi Joey. What would be the preferred way you would advise to use (ideally with minimal manual configuration) to make it happen? I.e. whenever user(s) get some load, it gets automagically annex copied to the cache?

Comment by yarikoptic — Mon Jan 13 19:01:07 2020

Remove comment

Hardlinks

I tried following the recipe above using git-annex v8 and successfully made a cache to which I can write efficient hardlinks to my working repos, but I am unable to read them back the same way, as hardlinks.

This means that on a 10GB dataset, where annex.thin lets me use only those 10GB, adding the cache doubles it to 20GB. This is not really a feasible amount of overhead for my use-case.

I've done a full report with test cases comparing different solutions (check the branches!) at https://github.com/kousu/test-git-annex-hardlinks.

There seem to be several tangled issues: annex.hardlink in the cache overrides annex.thin in the working repo (despite the manpage claming annex.thin overrides annex.hardlink), annex get and annex copy both want to do the equivalent of a one-step fetch and checkout and the checkout does a copy despite annex.thin being set.

Either:

I get hardlinks from ~/.annex-cache/.git/annex/objects <-> dataset/.git/annex/objects but a copy between dataset/.git/annex/objects <-> dataset/, or
If I disable annex.hardlink in the cache, then vice-versa: a copy happens between ~/.annex-cache/.git/annex/objects <-> dataset/.git/annex/objects and a hardlink happens between dataset/.git/annex/objects <-> dataset/.

In either case there's an extra full copy of my dataset, and I would rather not spend the time and space it takes to construct that every time I want to use my dataset somewhere.

I also tried mv ~/.annex-cache/.git/annex dataset/.git/ but that just confused git-annex fiercely.

I also tried git annex fix but it just seemed to do nothing. And anyway isn't much help since I need to run it after copy which has already done a wasteful copy. I thought maybe fix could at least recognize that annex.thin is set and undo the wasted copy but it doesn't.

I managed to work around it by side-stepping git-annex with find .git/annex/objects | ... | ln -f directly. This seems to work, and to not confuse git-annex too much -- it just makes an extra hardlink for some reason, but I can live with that.

What's the most supported way to cache and directly use the data in the cache? That's one of the main features I want in a cache and I can't figure out how to do it with git-annex.

Thanks for any pointers or clues towards getting this to work.

Comment by kousu — Fri Apr 16 08:17:35 2021

Remove comment

Add a comment