What is a star-topology
Basically having only one remote with private = false
(the default) setting and having all other machines with private = true
.
In this setup, all users and clones have to pull from the central private = false
remote, and they can't get/copy/sync directly between each other.
In exchange, the situation is easy to understand, easy to explain to non-technical people and easy to automate in a team where people who are not interested in git also have to participate. Also, the content of the git-annex
branch stays very simple and therefore gives way to easier debugging/hacking.
In some sense, you can call this git-annex stupidified back to git-lfs levels, but if you think about it, it's still a lot better, e.g. you can manage partial clones easily (by just not downloading the files that you are not interested in with git annex get
), and you also get the symlinks way of life without any git filters, which is honestly simply better than git-lfs.
Difficulties implementing a star-topology
The only problem with this, is that it's hard to enforce it, because to keep your git-annex
branch completely clean even in face of novice users, you have to ensure that EVERYBODY, ALWAYS, ON ALL THEIR MACHINES issue a git config annex.private true
command first, before starting to play around with git-annex based on tutorials/forums/email-threads/etc. In practice, this is not possible.
Feature request
I would like to have an uuid-allowlist.log
file in the root folder of the git-annex
branch, that if exists, is always read during startup of git-annex
for any operation that operates on the branch, and every line contains exactly one UUID.
If any output file is written anywhere into the git-annex
branch (trusted.log, uuid.log, remote.log, and also every file, e.g. xxx/yyy/SHA256E-...), this list is always consulted and if during writing the file git-annex
wants to write a non-allowlisted UUID for any reason, then it immediately stops with an error message, without committing to the branch. Of course, if we can make the check sooner, e.g. before adding it to the index of annex, that is even better.
This of course should work for all SSH git remotes, but also for all special remotes, if the UUID is allowlisted.
If there is no uuid-allowlist.log
file found, then nothing should change compared to the current implementation.
UI
Regarding the UI, I don't care too much, for me it's even good enough if it's implemented as an expert feature, and when I start the repository, I have to create the git-annex
branch by hand the first time and add this file.
It seems to me that making annex.private a global configuration that can be set with
git-annex config
and overridden locally with .git/config on the repository you want to record to the git-annex branch would have the same effects.While a user could override it in their clone with .git/config, they could also use a version of git-annex that ignores your uuid-allowlist.log.
Also, uuid-allowlist.log would imply that merging two git-annex branches could fail, if one of them referred to uuids that are not allowed in another one. Since git-annex does such merges in the backend, that would mean that git-annex could just start failing without any apparent reason why or anything for the user to do to fix it. I would not want to support the bug reports that would result from that.
Thank you for reading and replying to my feature request.
I agree with most of what you have said, and I agree that making
annex.private
git-annex config
urable is 99% of my feature request.I wrote the rest of the text, because I actually had a case, while testing this star topology thing where an uuid crept into the git-annex branch and it was coming from testing a special remote with rclone and not being careful enough with it. So, I think the global
annex.private
variable somehow also has to support special remotes and then there has to be a way to opt-out the one special remote, that is the real center of the star topology.git-annex initremote --private
is the equivilant of annex.private for a special remote, and sets remote.name.annex-private.And remote.name.annex-private can also be set to avoid recording anything about a git-annex remote.
Setting annex.private with
git config
does not affect the remotes. Which for many use cases of annex.private, is a good thing.git-annex config
settings should have the same effect as if the correspondinggit config
were set. Sogit-annex config
of annex.private is not quite the right thing, since that would not set remote.name.annex-private.So, I think you were somewhat on the right track, that there needs to be a list of uuids of repositories that are not private. And then all other repositories would behave the same as if annex.private were set for them, when git-annex is running on them or using them as a remote.
Concretely, this could look like
git-annex config --set annex.privateexcept "uuid1 uuid2"
Implementation would just involve the two places that currently check annexPrivateRepos also checking that.
One small problem with this idea is, if you want to add a new non-private repository or special remote after setting that git-annex config, you would need to run
git-annex config
to add it to the list first. Otherwise, information about it won't get recorded publically when it's initialized. So you would need to generate a uuid by hand, then update the list, then rungit-annex reinit
with that uuid, orgit-annex initremote
with theuuid=
parameter.What happens if someone sets this git-annex config, but the repo is also used by someone else, who does not want to honor that, and wants to have their own group of git-annex repos that work together as usual?
This is a reasonable difference of opinion to have, and this kind of disagreement needs to be considered when adding a
git-annex config
setting.Usually,
git-annex config
settings can be overridden bygit config
. So there would need to be a annex.privateexcept git config as well.An alternative way to get the same result would be for your centralized git repository to have a hook that filters unwanted uuids out of git-annex branch pushes.
To do the filtering, you could use git-annex filter-branch.
That reads the current git-annex branch and outputs the hash of a filtered commit.
For example, as a post-receive hook:
This post-receive hook is suboptimal because there is a period of time before it finishes filtering where a pull will see the unfiltered git-annex branch. Although maybe that would be ok, since a later push of that information back would get filtered again.
It would be better to use a git hook that let the information be filtered before it became active. Looking at githooks(5), the proc-receive hook may be able to do that. Not sure. To be used by such a hook, git-annex filter-branch would still need to see the information in a git-annex branch, so it might need to be run in a lightweight clone of the repository. Or, it might be possible to improve git-annex filter-branch to be able to filter a ref other than the git-annex branch.
But.. Using filter-branch like that seems like it would lead to a series of commits when no changes are really being made.
Consider a clone that has a git-annex branch with commit A. It pushes it to origin, which runs it through filter-branch, yeilding B. Then the clone pulls B, and git-annex merges B into A, yielding A'. If A contained nothing that got filtered out, A' and A have the same tree, but in any case they will be different commits. Then A' gets pushed, yielding B', and the clone pulls B', resulting in A'', and so on.
A solution to that would be to check, after filtering, if the tree sha is the same as the local git-annex branch currently has, or had at a point in the recent past. If so, it can avoid updating the git-annex branch at all, since no new information was received.
I think that would work both when there was nothing that got filtered out, and when there was. The only problem with it might be that since origin/git-annex would not be updated after a push, a subsequent push would waste a little bandwidth re-sending the local git-annex branch again.