semi-synchronized remotes

In general, git-annex repositories that are "synchronized" (e.g. with the git-annex-sync command, whatever the backend) have a global namespace. Repositories will eventually converge to have very exactly the same content, generally using git's push/pull/merge mechanisms.

What if we do not wish to exactly have the same content across all repositories, but still want to share some objects?

An example use case here is content (e.g. .git/annex/objects blobs) sharing, without having to deliberately collaborate over a globally consistent set of objects in the master branch. Think of a decentralized conference proceedings repository where each conference could add their own content to a conference-specific repository, while at the same time allowing a unified view in another, more centralized repository, or allowing users to pick and choose which conference they would want content from.

While each repository could have its own distinct branch, all repositories will see all those branches and this may affect content retention, as git-annex may consider files to be "in use" because they are on some remote branch, for example. Furthermore, I consider git branching to be a rather advanced topic in git usage. While git-annex uses those mechanisms (e.g. the git-annex and sync/* branches), those are generally hidden from the user until something goes wrong. Therefore I looked into providing a more straightforward approach to this problem for my users and myself.

In my use case, I have the following repositories:

repoA: my own curated media collection
repoB: a third-party media collection

I do not wish for my local curated collection (repoA) to be completely synchronized with the third-party collection (repoB). This is because we may have different tastes and retention policies: while I archive everything, there are certain media I am not interested in. On the other hand repoB might keep only (say) the last month of media and disard older content but have a more varied collection, which only a subset is interesting to me. Yet I still want to access some of that content!

So I did the following to add the third party repository:

git remote add repoB example.net:repoB
git annex sync --no-push repoB
git annex get --from=repoB

This works well: I get the files from repoB locally. Of course, if repoB expires some files, this will be impacted locally, but I can always revert those choices without conflict, because I do not push those back.

The downside of the --no-push option in git-annex-sync is that it needs to be made explicit at each invocation of the command. Furthermore, this option is not supported by the assistant, which will happily sync the master branch to all remotes by default.

An alternative is to manually fetch and merge content:

git fetch repoB
git annex merge repoB
git reset HEAD^
# revert any possible changes upstream we don't want
git commit

Needless to say this quickly becomes quite messy, but it's the amazing level of control git and git-annex provides, which obviously comes with its price in complexity. Such a method will also be ignored by the assistant and further sync commands.

To make sure those principles are respected in the assistant or a plain git annex sync that may mistakenly be ran in that repository, I need some special setting. There are the options I considered, in .gitconfig or git-annex's config options:

remote.<name>.annex-ignore=true: sync and assistant will not sync content to the repository, but explicit get --from=repoB will still work.
remote.<name>.annex-sync=false: sync (and assistant?) will not sync the git repository with the remote
remote.<name>.push=nothing: git won't push by default, unless branches are explicitly given, which may actually be the case for git-annex, so unlikely to work.
remote.<name>.pushurl=/dev/null: will completely disable any push functionality to that remote. any sync will yield the following error:
```
fatal: '/dev/null' does not appear to be a git repository
[...]
git-annex: sync: 1 failed
```
remote.<name>.pushurl=.: will push to the local repo instead. crude hack and may confuse the hell out of git-annex, but at least doesn't yield errors.

A similar approach to hacking the pushurl is to make repoB read-only to the user. This however, may trigger the activation of annex-ignore by git-annex and will otherwise yield the same warnings as the pushurl=/dev/null hack.

Right now, I am using annex-sync = false in .git/config. I have also configured the repository to be in the "manual" standard group which will avoid copying files into that repository:

$ git annex group repoB manual
group repoB ok
(recording state in git...)
$ git annex wanted repoB standard
wanted repoB ok
(recording state in git...)

This is roughly equivalent to setting annex-ignore = true, yet it allows for more flexibility. I could, for example, create custom content expressions to sync certain folders automatically.

A disadvantage of the annex-sync settings is that it affects both ways (push and pull), not just push, which is what I am interested in. Although it could be argued that restricting both is fine here because we want to manually review changes when we pull changes from those remotes anyways.

The best approach may be to have git-annex respect the remote.<name>.push=nothing setting. Another approach would be to add remote.<name>.annex-push and remote.<name>.annex-pull settings that would match the sync --[no-]push --[no-]pull flags.

Note that this is similar in concept to Bittorrent-like features, although here we assumes you already have some transport to share anything you need, yet still have to address the question of semi-synchronized git repositories in some way.

I would obviously welcome additional comments and questions on this approach. -- anarcat

RSS Atom

comment 1

Setting remote.<name>.annex-readonly=true prevents git-annex sync from pushing changes to the remote. It also prevents any git-annex command from copying annexed file contents to the remote, or deleting annexed file contents. So I think it's ideal for this kind of situation.

There does seem to be room for configs to prevent sync from pulling/pushing without making the remote fully readonly. For example, the remote might be a source of content, that only knows about the files it added and not other files in the local repository, so dropping files from it should be allowed but not pushing to it.

So, I've added remote.<name>.annex-push and remote.<name>.annex-pull.

Comment by joey — Wed Apr 5 16:11:57 2017

Remove comment