I first mentioned this issue in a thread about 4 years ago (https://git-annex.branchable.com/forum/git-annex_across_two_filesystems/), and at time was encouraged to instead open a new thread. Priorities changed, and I'm only now returning to the issue.
The situation we have is as follows: We have a large collection of boundary condition data used in our weather/climate model. Individual "experiments" are run against specific versions of this data and we would like to minimize the total storage footprint as well as time spent copying data at the beginning of an experiment. The clones for the experiments would always be used in a read-only manner. New files would never be added through these repos.
At first glance, using git-annex with git clone --shared
would seem to be a good solution. Unfortunately these experiments span a large number (~10) of separate cross-mounted filesystems and would result ~90% of the experiments still duplicating data rather than sharing across a hardlink.
A couple of partial solutions suggest themselves. (1) Put all of the clones on the same filesystem as the primary repo, and then create a symlink within each experiment back to the corresponding clone. (2) Maintain a secondary (fully populated) clone on each filesystem and ensure that the experiment setup script clones from the proper secondary.
Option (1) is viable, but would require some negotiations with the computing center to ensure that there is a single filesystem that gives appropriate privileges to all of our users. Tedious, but probably not a showstopper.
Option (2) sounds like an improvement over having 90% of the experiments duplicating data locally, except ... because the secondary clones would need to support any recent model configuration, the 10x duplication of "all" data could be much larger than the hundreds of copies of the smaller subsets needed by individual experiments.
Perhaps the ideal solution would be some sort of special "clone" that uses symlinks back to the primary repository. These special clones would be read only, and could even disable "dangerous" git actions that would allow adding/modifying files. git-new-workdir
hints that something like this might be possible, but it does not appear to play nicely with git-annex in any event.
Have I understood you correctly, you have a "primary" repository (with all data/keys present), accessible by the clients via NFS/cifs/whatever? And the clients(/"experiments") want to check out a specific version/branch from that repo?
I think you have two alternatives to cloning it everywhere including all keys:
a) Every client clones the git repo (and remove the "origin" remote to ensure that nothing flows back), creates a symlink from
.git/annex/objects
to/path/to/primary/.git/annex/objects
and checks out whatever version/branch it wants. Easy.b) Every client uses the primary repo, but via its own worktree (See
git-worktree
). git-annex supports external worktrees, but I'm not sure what problems could arise in this particular setup.Symlinking .git/annex/objects back to the primary repo might just work. I seem to remember that parts of git-annex get slow or may even not work if .git/annex/objects is on a different filesystem than .git/annex/. So I'd try symlinking .git/annex back to the primary repo instead. Note that this limits how the clone can be used, if the user lacks permissions to write to the primary repo.
The problem with git-worktree in your situation is that it actually needs to be run in the primary repo, and it modifies that repo, by eg creating .git/worktrees/. So users who do not have write access to the primary repo would not be able to do that, and you seemed to indicate it would be a problem getting all users write access to it.
I think maybe you could have one replica of the primary repo per drive, and then individual experiments run on that drive are set up with git-worktree in the drive's replica. This way you would avoid permissions problems with the primary repo, and also the experiment runs on a presumably more local and faster drive. When setting up each experiment, you can
git annex get
the files it needs, which will pull them into the replica.Now, that does run the risk that all the replicas end up caching a lot of data from old experiments that is not being used by new experiments. So you would want to find a way to clear out the unused data.
It seems helpful to that end that git worktree manages a branch for each worktree that's currently checked out. If each experiment used its own branch that contained only the files used in that experiment and not a lot of other files, then you could use
git annex unused --used-refspec=+refs/heads/*
to find data that was not used by any experiment that had a worktree using that replica.Alternatively, you could check, when removing an experiment's worktree from the replica, if that was the last experiment using that replica. When it was, simply use
git annex drop --all
to clean out the replica.Sorry for the delay in responding. I had a short vacation and jury duty that kept me from responding immediately.
The symlink approach suggested by @Lukey appears to work quite well from my limited testing this far. The git-worktree approach unfortunately required additional write permissions (as noted by @Joey) in the primary repository, and is thus not our preferred approach.
As far as the performance consequences to the symlink option, my tests thus far did not show any obvious performance consequences related to multi-filesystem involvement. Certainly my test repo (~100 files of 50 MB each) seemed to perform quite well, so a real test may effectively require deploying the full data set in this manner.
I had previously tried @Joey's suggestion of symlinking just .git/annex rather than .git/annex/objects. My recollection is that I got error messages, but will test again shortly.
I just wanted to confirm that the suggestion of linking .git/annex (as opposed to .git/annex/objects) appears to work as well. I'm not sure what I did wrong on my earlier attempt.
Performance measurements will not be available for quite some time, but anecdotally both approaches seem extremely fast.