What is a star-topology
Basically having only one remote with private = false
(the default) setting and having all other machines with private = true
.
In this setup, all users and clones have to pull from the central private = false
remote, and they can't get/copy/sync directly between each other.
In exchange, the situation is easy to understand, easy to explain to non-technical people and easy to automate in a team where people who are not interested in git also have to participate. Also, the content of the git-annex
branch stays very simple and therefore gives way to easier debugging/hacking.
In some sense, you can call this git-annex stupidified back to git-lfs levels, but if you think about it, it's still a lot better, e.g. you can manage partial clones easily (by just not downloading the files that you are not interested in with git annex get
), and you also get the symlinks way of life without any git filters, which is honestly simply better than git-lfs.
Difficulties implementing a star-topology
The only problem with this, is that it's hard to enforce it, because to keep your git-annex
branch completely clean even in face of novice users, you have to ensure that EVERYBODY, ALWAYS, ON ALL THEIR MACHINES issue a git config annex.private true
command first, before starting to play around with git-annex based on tutorials/forums/email-threads/etc. In practice, this is not possible.
Feature request
I would like to have an uuid-allowlist.log
file in the root folder of the git-annex
branch, that if exists, is always read during startup of git-annex
for any operation that operates on the branch, and every line contains exactly one UUID.
If any output file is written anywhere into the git-annex
branch (trusted.log, uuid.log, remote.log, and also every file, e.g. xxx/yyy/SHA256E-...), this list is always consulted and if during writing the file git-annex
wants to write a non-allowlisted UUID for any reason, then it immediately stops with an error message, without committing to the branch. Of course, if we can make the check sooner, e.g. before adding it to the index of annex, that is even better.
This of course should work for all SSH git remotes, but also for all special remotes, if the UUID is allowlisted.
If there is no uuid-allowlist.log
file found, then nothing should change compared to the current implementation.
UI
Regarding the UI, I don't care too much, for me it's even good enough if it's implemented as an expert feature, and when I start the repository, I have to create the git-annex
branch by hand the first time and add this file.
It seems to me that making annex.private a global configuration that can be set with
git-annex config
and overridden locally with .git/config on the repository you want to record to the git-annex branch would have the same effects.While a user could override it in their clone with .git/config, they could also use a version of git-annex that ignores your uuid-allowlist.log.
Also, uuid-allowlist.log would imply that merging two git-annex branches could fail, if one of them referred to uuids that are not allowed in another one. Since git-annex does such merges in the backend, that would mean that git-annex could just start failing without any apparent reason why or anything for the user to do to fix it. I would not want to support the bug reports that would result from that.
Thank you for reading and replying to my feature request.
I agree with most of what you have said, and I agree that making
annex.private
git-annex config
urable is 99% of my feature request.I wrote the rest of the text, because I actually had a case, while testing this star topology thing where an uuid crept into the git-annex branch and it was coming from testing a special remote with rclone and not being careful enough with it. So, I think the global
annex.private
variable somehow also has to support special remotes and then there has to be a way to opt-out the one special remote, that is the real center of the star topology.git-annex initremote --private
is the equivilant of annex.private for a special remote, and sets remote.name.annex-private.And remote.name.annex-private can also be set to avoid recording anything about a git-annex remote.
Setting annex.private with
git config
does not affect the remotes. Which for many use cases of annex.private, is a good thing.git-annex config
settings should have the same effect as if the correspondinggit config
were set. Sogit-annex config
of annex.private is not quite the right thing, since that would not set remote.name.annex-private.So, I think you were somewhat on the right track, that there needs to be a list of uuids of repositories that are not private. And then all other repositories would behave the same as if annex.private were set for them, when git-annex is running on them or using them as a remote.
Concretely, this could look like
git-annex config --set annex.privateexcept "uuid1 uuid2"
Implementation would just involve the two places that currently check annexPrivateRepos also checking that.
One small problem with this idea is, if you want to add a new non-private repository or special remote after setting that git-annex config, you would need to run
git-annex config
to add it to the list first. Otherwise, information about it won't get recorded publically when it's initialized. So you would need to generate a uuid by hand, then update the list, then rungit-annex reinit
with that uuid, orgit-annex initremote
with theuuid=
parameter.What happens if someone sets this git-annex config, but the repo is also used by someone else, who does not want to honor that, and wants to have their own group of git-annex repos that work together as usual?
This is a reasonable difference of opinion to have, and this kind of disagreement needs to be considered when adding a
git-annex config
setting.Usually,
git-annex config
settings can be overridden bygit config
. So there would need to be a annex.privateexcept git config as well.An alternative way to get the same result would be for your centralized git repository to have a hook that filters unwanted uuids out of git-annex branch pushes.
To do the filtering, you could use git-annex filter-branch.
That reads the current git-annex branch and outputs the hash of a filtered commit.
For example, as a post-receive hook:
This post-receive hook is suboptimal because there is a period of time before it finishes filtering where a pull will see the unfiltered git-annex branch. Although maybe that would be ok, since a later push of that information back would get filtered again.
It would be better to use a git hook that let the information be filtered before it became active. Looking at githooks(5), the proc-receive hook may be able to do that. Not sure. To be used by such a hook, git-annex filter-branch would still need to see the information in a git-annex branch, so it might need to be run in a lightweight clone of the repository. Or, it might be possible to improve git-annex filter-branch to be able to filter a ref other than the git-annex branch.
But.. Using filter-branch like that seems like it would lead to a series of commits when no changes are really being made.
Consider a clone that has a git-annex branch with commit A. It pushes it to origin, which runs it through filter-branch, yeilding B. Then the clone pulls B, and git-annex merges B into A, yielding A'. If A contained nothing that got filtered out, A' and A have the same tree, but in any case they will be different commits. Then A' gets pushed, yielding B', and the clone pulls B', resulting in A'', and so on.
A solution to that would be to check, after filtering, if the tree sha is the same as the local git-annex branch currently has, or had at a point in the recent past. If so, it can avoid updating the git-annex branch at all, since no new information was received.
I think that would work both when there was nothing that got filtered out, and when there was. The only problem with it might be that since origin/git-annex would not be updated after a push, a subsequent push would waste a little bandwidth re-sending the local git-annex branch again.
Sorry for resurrecting this after 2 years, I somehow forgot this discussion was ongoing.
So, first of all, thank you so much for taking the time to writing up a very cool server side solution for the problem. Do I understand your proposal correctly, that basically on the server we would always store a git-annex rewritten branch as if it was correctly written by the client, no matter what the clients do on their own in their own git-annex branches, right?
And since all the merging in git-annex is line based, this constant rewrite wouldn't confuse the clients when they
git fetch --all
+git annex merge
? Wouldn't the merge commits ingitk git-annex
be very weird to understand?So what I don't understand, is that if we do this on the central server side, then yes, the rewrite on the server is good, but when the offending client does a
git fetch
+git annex merge
, it will create a merge commit with 2 parents. Will we also straighten that out automatically and delete the "stupid" side on the next push? Doesn't this mean, that debugging just becomes more confusing and this client will create longer and longer side branches on its graphical branch view ofgitk git-annex
?Let me reflect back to your "comment 5", where you asked the very valid question of what to do in case of difference of opinions. I think the correct solution is to implement the override feature (in .git/config, as you said), and let it completely happen. If the only way for unwanted UUIDs to appear in my central repo is for someone to use this extra feature, I'm OK with that. I want to prevent accidents, and I certainly don't want to prevent expert power-users achieving their goals when needed, so local override (even if the end result is pushed back), is 100% fine.
Now, that I'm thinking about this as a "reasonable difference of opinion to have", an interesting "solution" comes to mind, that opens up of course a very big discussion: why in the design of git-annex there is only ONE AND ONLY git-annex branch? Git has orphan branches, and it would be legit to say, that different group of people working in a repo, have different opinion of "view of the annex", e.g. they think different repos (or special remotes) are important or unimportant for them. I mention this question not really seriously as a proposal to redesign, but I'm sure that you had this idea sometime in the past, and if you have some insight or revelations, I'd be happy to read it.