Ensure all versions are on remotes

I thought I was getting the hang of annex, but have run into a bit of a problem. I could use some help ensuring that everything ends up in the right place.

Specifically, it seems like sync does a lot more than push and pull files - that it might actually try to drop things from remotes (at least with the -a) command.

I have two machines I work on, and should only have active content on them. I have two special remotes (S3/wasabi) that should have everything that's ever been annexed, including old versions of files.

If I git annex sync -a -A then it will pull all versions locally as well. So I think I may have to separate the get and copy commands?

Here's what I'm doing so far:

git annex config --set annex.synccontent false
git annex config --set annex.synconlyannex true
git annex config --set annex.autocommit false

git annex group wasabi-east wasabi
git annex group wasabi-west wasabi
git annex groupwanted wasabi anything
git annex required wasabi-east groupwanted
git annex required wasabi-west groupwanted

git annex group machine1 active
git annex group machine2 active
git annex groupwanted active anything

# from machine1
git annex sync -a origin machine2 wasabi-east wasabi-west
git annex get -a
for remote in "wasabi-east wasabi-west"
do
  git annex copy -A --not --in $remote -t $remote
done

I think that's what I need to do? I don't think I can use git annex sync -a -A wasabi-east wasabi-west because I don't want to pull old versions to my local machine.

RSS Atom

comment 1

The unused preferred content expression is probably what you're looking for.

As for your second problem, add your client repositories to a group aswell and make them only want present, approxlackingcopies=1 or something along those lines. That will stop sync --content from trying to pull down everything.

Comment by Atemu — Tue Sep 13 09:53:45 2022

Remove comment

comment 2

sync --content will certianly remove files from a repository when the preferred content settings for that repository indicate it should not contain that content.

When you use sync --all, a preferred content setting like "include=*" or "exclude=*" will only ever match files in the current working tree, not past versions of files.

So, if the remote has such a preferred content expression, sync --all --content will remove the past versions of files from it.

The way to avoid this behavior is to use a preferred content expression that does not match on the filename. Eg, "anything". Or don't set a preferred content expression in the first place.

Comment by joey — Tue Sep 13 18:40:02 2022

Remove comment

comment 3

hrm… as you can see in my post, I AM using “anything” as the wanted content. So I would expect all of the remotes (wasabi and machines) to get all of the file versions. But that’s not happening. It’s behaving more like “used” would.

I will try “anything or unused” despite the fact that it seems like “or unused” should be unnecessary.

Comment by pat — Tue Sep 13 21:14:02 2022

Remove comment

comment 4

I think I've set up what you wanted.. And I think enroute I started to understand the problem you were having.

Both of the wasabis want all content whether it's unused or not. So I left their preferred and required content settings unchanged since that's the default. ("anything" would have the same effect). To make the local repository want to not hang onto unused content I used:

git-annex wanted here 'include=*'

With that, git-annex sync --content --all wasabi-east would copy an unused key to wasabi-east. But then it would drop it from the local repository. So a subsequent sync with wasabi-west was not able to send a copy to there, because it's already been removed from the local repo. I think perhaps that is the problem you were having?

A workaround is to sync with both at once, like you have been doing:

joey@darkstar:~/tmp/bench/foo>git-annex sync --content --all wasabi-west wasabi-east
commit 
On branch master
nothing to commit, working tree clean
ok
copy SHA256E-s30--4b03bc898384c2cf2861327108e988ba56d839b831e743d87b066cc4a5a7f487 (to wasabi-west...) 
ok                                
copy SHA256E-s30--4b03bc898384c2cf2861327108e988ba56d839b831e743d87b066cc4a5a7f487 (to wasabi-east...) 
ok                                
drop SHA256E-s30--4b03bc898384c2cf2861327108e988ba56d839b831e743d87b066cc4a5a7f487 ok
(recording state in git...)

But if you forget to sync with both at the same time, or if one of them is unreachable, you can end up with only one of them having a copy. Not great.

To avoid that problem, I set numcopies, to force there to be 2 copies at all times:

joey@darkstar:~/tmp/bench/foo>git-annex numcopies 2
numcopies 2 ok

Now syncing with one of the wasabi remotes keeps the unused content locally present:

joey@darkstar:~/tmp/bench/foo>git-annex sync --content wasabi-east --all
commit 
On branch master
nothing to commit, working tree clean
ok
copy SHA256E-s30--906a7a36a37e2ffcee8164da8d7275a7fc1d775b3a9c040771725c9f1e4a9222 (to wasabi-east...) 
ok

Once the content reaches the other wasabi remote too, it can drop the local copy:

joey@darkstar:~/tmp/bench/foo>git-annex sync --content wasabi-west --all
commit 
On branch master
nothing to commit, working tree clean
ok
copy SHA256E-s30--906a7a36a37e2ffcee8164da8d7275a7fc1d775b3a9c040771725c9f1e4a9222 (to wasabi-west...) 
ok                                
drop SHA256E-s30--906a7a36a37e2ffcee8164da8d7275a7fc1d775b3a9c040771725c9f1e4a9222 ok

But you also have two repositories that you work in. That complicates things a bit. Let's bring repository bar into the picture:

joey@darkstar:~/tmp/bench/bar>git-annex wanted here 'include=*'
wanted here ok

Now, when there's a file that is on foo and bar and wasabi-east, syncing with wasabi-west will copy it to there. But what if the file gets deleted, and we sync with wasabi-east before syncing with wasabi-west?

joey@darkstar:~/tmp/bench/bar>git-annex get myfile
get myfile (from wasabi-east...) 
ok                                
(recording state in git...)
joey@darkstar:~/tmp/bench/bar>git-annex whereis myfile
whereis myfile (3 copies) 
    948c3bc7-91ce-4a2e-880d-a5e614c8f0e1 -- [foo]
    db04e144-c10a-4ed2-a557-f3fd6d5410c3 -- bar [here]
    fc4dbe84-ddd5-4bc3-b9c5-f5262df528a3 -- [wasabi-east]
joey@darkstar:~/tmp/bench/bar>git rm myfile 
rm 'myfile'
joey@darkstar:~/tmp/bench/bar>git commit -m removed
[master e8cb8fe] removed
joey@darkstar:~/tmp/bench/bar>git-annex sync --content wasabi-east --all
commit 
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
ok
drop SHA256E-s30--a8ae034d9371c80258e9d2e0fd436c6974a839dd42c220792f4cac8f597de3d1 ok

It dropped it because there are enough copies that it could. Now a sync with wasabi-west from bar won't send the content to it, since bar no longer has it. Since foo still has a copy, syncing with wasabi-west on foo will move it to there still. But this is perhaps suboptimal.

A better configuration, that avoids that problem, but is more complicated follows:

git-annex group wasabi-east collector
git-annex group wasabi-west collector
git-annex wanted foo 'include=* or not inallgroup=collector'
git-annex wanted bar 'include=* or not inallgroup=collector'

This forces the work repositories to hang onto unused keys until they reach all the collector repositories.

Comment by joey — Thu Sep 15 18:31:04 2022

Remove comment

comment 5

Thanks for looking into this, and explaining. Your final configuration makes sense to me... I can't say I fully understand why, but I need -a otherwise the local repo will get all versions, leaving me with a bunch of unused keys. So my command is git annex sync -a -A wasabi-east wasabi-west. I took the other machine out of the sync because sometimes it's offline and I don't want to wait around for an SSH timeout.

I have tested that if I force drop a key from wasabi-east, that sync command will get the key from wasabi-west, and then copy it to wasabi-east. It doesn't automatically drop it locally, I have to do git annex drop --unused - but that's not a big deal.

I would just like to be confident that the wasabi remotes are a lockbox for any keys added to my annex. So... I think it works? I'll keep using it and report back if I run into any weirdness.

Comment by pat — Wed Sep 28 08:40:11 2022

Remove comment

Add a comment