What is a star-topology

Basically having only one remote with private = false (the default) setting and having all other machines with private = true.

In this setup, all users and clones have to pull from the central private = false remote, and they can't get/copy/sync directly between each other.

In exchange, the situation is easy to understand, easy to explain to non-technical people and easy to automate in a team where people who are not interested in git also have to participate. Also, the content of the git-annex branch stays very simple and therefore gives way to easier debugging/hacking.

In some sense, you can call this git-annex stupidified back to git-lfs levels, but if you think about it, it's still a lot better, e.g. you can manage partial clones easily (by just not downloading the files that you are not interested in with git annex get), and you also get the symlinks way of life without any git filters, which is honestly simply better than git-lfs.

Difficulties implementing a star-topology

The only problem with this, is that it's hard to enforce it, because to keep your git-annex branch completely clean even in face of novice users, you have to ensure that EVERYBODY, ALWAYS, ON ALL THEIR MACHINES issue a git config annex.private true command first, before starting to play around with git-annex based on tutorials/forums/email-threads/etc. In practice, this is not possible.

Feature request

I would like to have an uuid-allowlist.log file in the root folder of the git-annex branch, that if exists, is always read during startup of git-annex for any operation that operates on the branch, and every line contains exactly one UUID.

If any output file is written anywhere into the git-annex branch (trusted.log, uuid.log, remote.log, and also every file, e.g. xxx/yyy/SHA256E-...), this list is always consulted and if during writing the file git-annex wants to write a non-allowlisted UUID for any reason, then it immediately stops with an error message, without committing to the branch. Of course, if we can make the check sooner, e.g. before adding it to the index of annex, that is even better.

This of course should work for all SSH git remotes, but also for all special remotes, if the UUID is allowlisted.

If there is no uuid-allowlist.log file found, then nothing should change compared to the current implementation.

UI

Regarding the UI, I don't care too much, for me it's even good enough if it's implemented as an expert feature, and when I start the repository, I have to create the git-annex branch by hand the first time and add this file.

RSS Atom

comment 1

It seems to me that making annex.private a global configuration that can be set with git-annex config and overridden locally with .git/config on the repository you want to record to the git-annex branch would have the same effects.

While a user could override it in their clone with .git/config, they could also use a version of git-annex that ignores your uuid-allowlist.log.

Also, uuid-allowlist.log would imply that merging two git-annex branches could fail, if one of them referred to uuids that are not allowed in another one. Since git-annex does such merges in the backend, that would mean that git-annex could just start failing without any apparent reason why or anything for the user to do to fix it. I would not want to support the bug reports that would result from that.

Comment by joey — Thu Mar 10 17:11:59 2022

Remove comment

comment 2

Thank you for reading and replying to my feature request.

I agree with most of what you have said, and I agree that making annex.private git-annex configurable is 99% of my feature request.

I wrote the rest of the text, because I actually had a case, while testing this star topology thing where an uuid crept into the git-annex branch and it was coming from testing a special remote with rclone and not being careful enough with it. So, I think the global annex.private variable somehow also has to support special remotes and then there has to be a way to opt-out the one special remote, that is the real center of the star topology.

Comment by ErrGe — Fri Mar 11 02:23:20 2022

Remove comment

comment 3

Just one addition to my previous comment: I agree with the notion, that this is under no circumstances a security or a kindergarten feature. So if someone intentionally overrides it, then of course, that's the fact of life, and I have to handle those team members in other social ways, not via technological overengineering. So if the final proposal provides a way to protect against accidental git clones and accidental special remotes both, then I'm good. It only has to protect against accidents, not against malice.

Comment by ErrGe — Fri Mar 11 02:26:56 2022

Remove comment

comment 4

git-annex initremote --private is the equivilant of annex.private for a special remote, and sets remote.name.annex-private.

And remote.name.annex-private can also be set to avoid recording anything about a git-annex remote.

Setting annex.private with git config does not affect the remotes. Which for many use cases of annex.private, is a good thing.

git-annex config settings should have the same effect as if the corresponding git config were set. So git-annex config of annex.private is not quite the right thing, since that would not set remote.name.annex-private.

So, I think you were somewhat on the right track, that there needs to be a list of uuids of repositories that are not private. And then all other repositories would behave the same as if annex.private were set for them, when git-annex is running on them or using them as a remote.

Concretely, this could look like git-annex config --set annex.privateexcept "uuid1 uuid2"

Implementation would just involve the two places that currently check annexPrivateRepos also checking that.

One small problem with this idea is, if you want to add a new non-private repository or special remote after setting that git-annex config, you would need to run git-annex config to add it to the list first. Otherwise, information about it won't get recorded publically when it's initialized. So you would need to generate a uuid by hand, then update the list, then run git-annex reinit with that uuid, or git-annex initremote with the uuid= parameter.

Comment by joey — Tue Mar 29 18:06:51 2022

Remove comment

comment 5

What happens if someone sets this git-annex config, but the repo is also used by someone else, who does not want to honor that, and wants to have their own group of git-annex repos that work together as usual?

This is a reasonable difference of opinion to have, and this kind of disagreement needs to be considered when adding a git-annex config setting.

Usually, git-annex config settings can be overridden by git config. So there would need to be a annex.privateexcept git config as well.

Comment by joey — Tue Mar 29 18:43:11 2022

Remove comment

comment 6

An alternative way to get the same result would be for your centralized git repository to have a hook that filters unwanted uuids out of git-annex branch pushes.

To do the filtering, you could use git-annex filter-branch.

git-annex filter-branch --all --include-key-information-for=$uuid \
    --include-global-config --include-repo-config-for=$uuid

That reads the current git-annex branch and outputs the hash of a filtered commit.

For example, as a post-receive hook:

#!/bin/sh
uuid=06ba602a-afa8-11ec-a6b9-87c2c2ae9296
ref=$(git-annex filter-branch --all --include-key-information-for=$uuid \
    --include-global-config --include-repo-config-for=$uuid)
git update-ref refs/heads/git-annex $ref
# Necessary since the git-annex branch has been changed
rm .git/annex/index

This post-receive hook is suboptimal because there is a period of time before it finishes filtering where a pull will see the unfiltered git-annex branch. Although maybe that would be ok, since a later push of that information back would get filtered again.

It would be better to use a git hook that let the information be filtered before it became active. Looking at githooks(5), the proc-receive hook may be able to do that. Not sure. To be used by such a hook, git-annex filter-branch would still need to see the information in a git-annex branch, so it might need to be run in a lightweight clone of the repository. Or, it might be possible to improve git-annex filter-branch to be able to filter a ref other than the git-annex branch.

Comment by joey — Tue Mar 29 21:23:09 2022

Remove comment

comment 7

But.. Using filter-branch like that seems like it would lead to a series of commits when no changes are really being made.

Consider a clone that has a git-annex branch with commit A. It pushes it to origin, which runs it through filter-branch, yeilding B. Then the clone pulls B, and git-annex merges B into A, yielding A'. If A contained nothing that got filtered out, A' and A have the same tree, but in any case they will be different commits. Then A' gets pushed, yielding B', and the clone pulls B', resulting in A'', and so on.

A solution to that would be to check, after filtering, if the tree sha is the same as the local git-annex branch currently has, or had at a point in the recent past. If so, it can avoid updating the git-annex branch at all, since no new information was received.

I think that would work both when there was nothing that got filtered out, and when there was. The only problem with it might be that since origin/git-annex would not be updated after a push, a subsequent push would waste a little bandwidth re-sending the local git-annex branch again.

Comment by joey — Tue Mar 29 21:49:26 2022

Remove comment

hook idea implementation is cool, but usage is not so simple for the enduser

Sorry for resurrecting this after 2 years, I somehow forgot this discussion was ongoing.

So, first of all, thank you so much for taking the time to writing up a very cool server side solution for the problem. Do I understand your proposal correctly, that basically on the server we would always store a git-annex rewritten branch as if it was correctly written by the client, no matter what the clients do on their own in their own git-annex branches, right?

And since all the merging in git-annex is line based, this constant rewrite wouldn't confuse the clients when they git fetch --all + git annex merge? Wouldn't the merge commits in gitk git-annex be very weird to understand?

So what I don't understand, is that if we do this on the central server side, then yes, the rewrite on the server is good, but when the offending client does a git fetch + git annex merge, it will create a merge commit with 2 parents. Will we also straighten that out automatically and delete the "stupid" side on the next push? Doesn't this mean, that debugging just becomes more confusing and this client will create longer and longer side branches on its graphical branch view of gitk git-annex?

Let me reflect back to your "comment 5", where you asked the very valid question of what to do in case of difference of opinions. I think the correct solution is to implement the override feature (in .git/config, as you said), and let it completely happen. If the only way for unwanted UUIDs to appear in my central repo is for someone to use this extra feature, I'm OK with that. I want to prevent accidents, and I certainly don't want to prevent expert power-users achieving their goals when needed, so local override (even if the end result is pushed back), is 100% fine.

Now, that I'm thinking about this as a "reasonable difference of opinion to have", an interesting "solution" comes to mind, that opens up of course a very big discussion: why in the design of git-annex there is only ONE AND ONLY git-annex branch? Git has orphan branches, and it would be legit to say, that different group of people working in a repo, have different opinion of "view of the annex", e.g. they think different repos (or special remotes) are important or unimportant for them. I mention this question not really seriously as a proposal to redesign, but I'm sure that you had this idea sometime in the past, and if you have some insight or revelations, I'd be happy to read it.

Comment by ErrGe — Thu Apr 18 01:17:02 2024

Remove comment

Add a comment