Need to import ~10TB of data from a content synced from a google storage. I have BTRFS file system (so CoW is available), and initialized it with
git annex initremote buckets type=directory importtree=yes encryption=none directory=../buckets/
and ../buckets
resides on the same mount point, so cp --reflink=always ../buckets/....
works out nicely.
But when I have ran git annex import --from=buckets -J 10 master
I saw that
- no
cp
is invoked by git-annex (according topstree
) - I have equal amount of output IO as input IO (in
dstat
), suggesting that I am actually copying data
I have not done a really detailed look inside copied keys to see if they share storage extents etc, so my observation could still be wrong.
Joey, is it expected to take advantage of CoW with git-annex 8.20210223-1~ndall+1 in such a case or it is still a TODO? any "workaround"? (e.g. may be I could just cp --reflink=always, git annex add, and then do import and annex would just reuse keys already CoWed?)
It currently does not make reflinks. That should be relatively easy to fix.
Of course, if you can get the content into the git-annex repository by other means like manual cp --reflink, git-annex will never have any reason to transfer the imported file from the remote, so that workaround would work.
Also, git-annex import from directory special remote changed fairly recently to only hash the files and not get their contents. But an older version would. Anyway, when the contents do eventually need to be copied from it, the above still applies.
Er, actually the directory special remote never uses cp --reflink, even when it's a key/value store. That's only implemented for gets from git remotes currently. So some refactoring will be needed.
FWIW: doing that now. Minor pain will come later when I will need to update it (only once again) with a new sync'ed data (some files will be added, some removed) -- to simplify my own life I will just
rm -rf
in target git-annex repo, redocp --reflink
and redodatalad save
(which will take care about removing gone, andgit-annex add
on all others -- but that would take again many hours to complete even though only a handful of files will be changed)Worth noting that
git annex import /dir
also does not use CoW. However, since that part of the import interface is desired to be replaced with importing from a special remote, I'm inclined not to go the extra mile for CoW there.I think I'll be able to implement the Cow before you have to workaround again, yarik.
Correction: Import from directory special remote copies the content by default still. But with --no-content it does not. Might be you could have used that, if you did not want to load the content up into your repo and were ok leaving it on the remote.
woohoo -- I will give it a shot! (might as well just interrupt ongoing "process")
it is good that it copies content -- that is the point for use of CoW here - to gain a full copy of the content virtually at no (storage) cost, so if original directory gets it changed - its copy would be all nicely versioned etc in the git-annex land