forum/avoiding copies across mutiple file systemsgit-annexhttp://git-annex.branchable.com/forum/avoiding_copies_across_mutiple_file_systems/git-annexikiwiki2021-05-17T14:53:19Zcomment 1http://git-annex.branchable.com/forum/avoiding_copies_across_mutiple_file_systems/comment_1_2ff3e4d2af575b27b14a0a936ad8c52f/Lukey2021-05-05T18:08:49Z2021-05-05T18:08:49Z
<p>Have I understood you correctly, you have a "primary" repository (with all data/keys present), accessible by the clients via NFS/cifs/whatever? And the clients(/"experiments") want to check out a specific version/branch from that repo?</p>
<p>I think you have two alternatives to cloning it everywhere including all keys:</p>
<p>a) Every client clones the git repo (and remove the "origin" remote to ensure that nothing flows back), creates a symlink from <code>.git/annex/objects</code> to <code>/path/to/primary/.git/annex/objects</code> and checks out whatever version/branch it wants. Easy.</p>
<p>b) Every client uses the primary repo, but via its own worktree (See <code>git-worktree</code>). git-annex supports external worktrees, but I'm not sure what problems could arise in this particular setup.</p>
comment 2http://git-annex.branchable.com/forum/avoiding_copies_across_mutiple_file_systems/comment_2_96e72de1dc7cefdaa3beedc837639f57/joey2021-05-12T16:22:48Z2021-05-12T15:44:23Z
<p>Symlinking .git/annex/objects back to the primary repo might just work. I
seem to remember that parts of git-annex get slow or may even not work if
.git/annex/objects is on a different filesystem than .git/annex/. So I'd
try symlinking .git/annex back to the primary repo instead. Note that this
limits how the clone can be used, if the user lacks permissions to write to
the primary repo.</p>
<p>The problem with git-worktree in your situation is that it actually
needs to be run in the primary repo, and it modifies that repo,
by eg creating .git/worktrees/. So users who do not have write access to
the primary repo would not be able to do that, and you seemed to indicate
it would be a problem getting all users write access to it.</p>
<p>I think maybe you could have one replica of the primary repo per drive, and
then individual experiments run on that drive are set up with git-worktree
in the drive's replica. This way you would avoid permissions problems
with the primary repo, and also the experiment runs on a presumably more
local and faster drive. When setting up each experiment, you can <code>git annex
get</code> the files it needs, which will pull them into the replica.</p>
<p>Now, that does run the risk that all the replicas end up caching a lot of
data from old experiments that is not being used by new experiments. So
you would want to find a way to clear out the unused data.</p>
<p>It seems helpful to that end that git worktree manages a branch for each
worktree that's currently checked out.
If each experiment used its own branch that contained only the files used
in that experiment and not a lot of other files, then you could use
<code>git annex unused --used-refspec=+refs/heads/*</code> to find data that was not
used by any experiment that had a worktree using that replica.</p>
<p>Alternatively, you could check, when removing an experiment's worktree from
the replica, if that was the last experiment using that replica. When it
was, simply use <code>git annex drop --all</code> to clean out the replica.</p>
comment 3http://git-annex.branchable.com/forum/avoiding_copies_across_mutiple_file_systems/comment_3_43db664a00fb9956ad392b5dc23565ca/tom_clune2021-05-12T16:41:07Z2021-05-12T16:41:07Z
<p>Sorry for the delay in responding. I had a short vacation and jury duty that kept me from responding immediately.</p>
<p>The symlink approach suggested by @Lukey appears to work quite well from my limited testing this far. The git-worktree approach unfortunately required additional write permissions (as noted by @Joey) in the primary repository, and is thus not our preferred approach.</p>
<p>As far as the performance consequences to the symlink option, my tests thus far did not show any obvious performance consequences related to multi-filesystem involvement. Certainly my test repo (~100 files of 50 MB each) seemed to perform quite well, so a real test may effectively require deploying the full data set in this manner.</p>
<p>I had previously tried @Joey's suggestion of symlinking just .git/annex rather than .git/annex/objects. My recollection is that I got error messages, but will test again shortly.</p>
updatehttp://git-annex.branchable.com/forum/avoiding_copies_across_mutiple_file_systems/comment_4_6584f98910d91746203405d57d17b8b0/tom_clune2021-05-17T14:53:19Z2021-05-17T14:53:19Z
<p>I just wanted to confirm that the suggestion of linking .git/annex (as opposed to .git/annex/objects) appears to work as well. I'm not sure what I did wrong on my earlier attempt.</p>
<p>Performance measurements will not be available for quite some time, but anecdotally both approaches seem extremely fast.</p>