Hey Joey,
If I understand correctly, the default content expression (when it's empty, e.g. after a git annex init or git clone ...;git annex sync) is currently apparently anything. This means that a git annex sync --content (or just git annex sync if git config --set annex.synccontent true) will fetch all files.
It would be very handy if there was something like:
git annex config --set annex.defaultwanted ...
git annex config --set annex.defaultgroup ...
git annex config --set annex.defaultgroupwanted ...
git annex config --set annex.defaultrequired ...
# and the corresponding git variant for user-overriding
git config [--global|--system] annex.defaultwanted ...
git config [--global|--system] annex.defaultgroup ...
git config [--global|--system] annex.defaultgroupwanted ...
git config [--global|--system] annex.defaultrequired ...
These defaults would be applied when git annex initializes a repository (i.e. gives it a annex.uuid, e.g. git annex init or git annex sync of a fresh clone of a repo with annex).
I like my annexed/datalad repos (mostly research data next to analysis code for collaboration) to have annex.synccontent = true so people can just do (datalad save/git annex add) git annex sync and be sure afterwards everything is in order and safe. However as the default wanted is anything (apparently), they also get all files they probably don't want if they don't to git annex wanted . present manually (and manual boilerplate config and extra steps is always something that's nice to automate). Something like git annex config --set annex.defaultwanted present would solve this.
Thanks again very much for git-annex, I love it! 💛
Yann
This came up again at the distribits meeting.
DataLad itself is designed to work like
git annex wanted . present(i.e. content is supposed to be fetched manually. It is assumed that the user does not generally want all content of a DataLad dataset / git annex repo). DataLad could itself rungit annex wanted . presentas part of its setup (talked about that with @mih), but I still think a setting in the git-annex branch that auto-sets the above settings in fresh clones (even when using plain git annex, not DataLad), is useful. It enhances the user experience of sparse checkouts (agit annex assistin a freshly cloned annex repo can then be configured to only pull specific or no files).I also discussed it with people in the context of handling confidential patient data that should not necessarily be copied everywhere. The default of just wanting all worktree content increases the delicacy of the matter a bit. Were there a way to have fresh clones (or even freshly created remotes that were not yet given a preferred content manually) have a preconfigured default wanted content, it would reduce the possibility of confidential data accidentally being copied all over the place.
I agree, after discussions at distribits it's clear there is use for this in datalad, and in git-annex generally.
I am trying to work around this, but can't really find a solution. Any of the following for one-off overriding of the preferred content would help, but apparently something like this doesn't exist, right? One has to go through
git annex wanted . present?git -c annex.wanted=present annex assistGIT_ANNEX_WANTED=present git annex assistgit annex assist|sync --wanted=presentpreferred content and git-annex-common-options don't list anything obvious in that direction.
An environment variable would be particularly great as that's easy to deploy globally.
Seems I really dropped the ball on following up to this one. On the other hand, it seems a lot of things need to be thought through still..
I suppose there are two ways a default preferred content config could work:
git-annex init(or autoinit) time, when the repository does not already have a preferred content setting. Also atgit-annex initremotetime for special remotes.With option #1, it gets baked into the repo, while with option #2 you can change a single git config later and it affects whatever repos.
Pretty sure people have been wanting option #1.
And option #2 seems to have a problem, that git-annex could see different preferred content settings for the same repository when run in different places. Which could result in a churn of content being added to a repository, and later dropped from it.
So option #1 seems like the right one.
Looking back at the original request, there was the idea that
git annex configcould set the default.Every
git annex configsetting needs to be considered for security and unwanted behavior.As far as security goes, if someone can set
git-annex config, they can just go in and change the preferred content settings of any repository. So no difference?Well, there is a small one. If I have made a clone of a repository, I may be hiding the existence of that repository from others. So nobody knows its uuid, and so they cannot change its preferred content setting. But with
git-annex configallowing overriding the default, a clone I made yesterday may behave differently than a clone I make today.Which, since the default is to want all files, must make clone to want fewer files.
So for this to be an actual security problem, I would need to be relying on my clones getting all files for some security reason. Which could be auditing the content of annexed files. I want the auditing clones to get every file that passes through origin. And by foolishly relying on the current default preferred content (which after all joey seems like he's never gonna get around to changing!), I open myself up to an attacker breaking my auditing process.
That's a bit tortured, but it does seem to argue against making this a
git-annex configsetting.The original request also included annex.defaultgroupwanted ... I don't see how that would work. groupwanted varies by group, it does not make sense to have a default that works across groups.
It does seem to make sense to allow annex.defaultgroup to set the default group(s) of a new repository.
Implemented git configs annex.defaultwanted, annex.defaultrequired, and annex.defaultgroups.
Hi joey, thank you for picking this up. IIUC, what you implemented (
git config annex.default{wanted,required,group}) allows you to set these configs locally and then spare yourself the initialgit annex wanted . present(etc.) setup calls. This is cool, thanks!The problem I was trying to express here is however that
git annex assist(the very convenient do-it-all command you can tell non-techy people to use to 'do the syncing stuff') will by default pull in all files, resulting in a terrible user experience: it's slow (of course nobody setsannex.jobs=cpusor uses-j4), it takes up a ridiculous amount of space, people will say 'I don't need that 3GB file, why does it download it?' (of course nobody remembers or understands to setgit annex wanted . presentor anything complex), etc. Sure, this is a question of user education, but good defaults can make for a much easier onboarding experience. (I know you are not so fond of such a do-it-all command, but thisgit annex assistsingle-stepping command really has been a good git annex selling point in the discussions and talks I had.)So if there was a global setting like
git annex config --set annex.defaultwanted 'present or include=*.pdf'that would set the default wanted expression for any clone, one could define what the most important files are and tell everyone togit annex getthe others if necessary.git annex assistwill be fast, only pull in the most important files (or none!), people can modify or add new stuff, and rungit annex assistquickly again.I would say
git annex config --set annex.defaultwanted <whatever>should not executegit annex wanted . <whatever>and as such hard-code it in the git-annex branch for every repo (because then again, when would that even be executed? Would it be re-set after anothergit annex config --set annex.defaultwanted <whatever2>? When?). Instead,git annex --set annex.defaultwanted <whatever>should cause the default (i.e. fallback) value ofgit annex wanted .to be<whatever>, which is currently just"", which I guess means something likeinclude=*IIRC.Re: your security concerns
I understand your hesitation to add more
git annex config ...global repo configs. But here I would argue:git annex pull|assistin them to check if it still works? In that case the only negative thinggit annex config --set annex.defaultwantedcould do is indeed leaving you with less downloaded files. If one needs all files,git annex get --allhas always been the way to go, hasn't it? 🤔 Or what kind of external repos from bad actors maliciously setting a default wanted expression do you 'audit'? And how is not having all files aftergit annex assistbad in this case?Should you consider implementing
git annex config --set annex.defaultwanted, it would conflict with the freshly introducedgit config annex.defaultwantedlocal settings. We could rename those togit config annex.initdefaultwanted(or justannex.initwanted), to emphasize that those only happen ongit annex init. Thengit annex config --set annex.defaultwanted` would also sound very sensible to me in contrast, as it really configures the default, and does not modify individual repos.Cheers, Yann
I'm on the fence about whether the kind of security impact I discussed earlier is really something that should prevent a global setting, or not.
git-annex configofannex.securehashesonlyis another example of something where my hypothetical "auditing repos" would be vulnerable to a behavior change that might be security significant. Since that gets copied from the git-annex config to git config at init time, behavior in a new clone might be different than behavior in an existing clone.Does that mean it's ok for there to be more cases where there can be such a potential security impact? I don't know.
Note that you can set annex.defaultwanted to "standard", and annex.defaultgroups to some group, and then changing
git-annex groupwantedwill affect all repositories that copied that defaultwanted into their config.So that's a way to be able to make changes that will affect other people's clones. But only ones that they have opted into.
If annex.defaultwanted were able to be changed for all repositories with
git-annex config, then here's a really ugly security problem:git-annex config annex.defaultwanted nothingNow, the same can be done by convincing people to add their repository to some group and set preferred content to "standard", and later changing the groupwanted. But that only works on people you were able to social engineer to doing that, not everyone who cloned a repository with the default settings.
And beyond the ransom problem, there's the problem that once this is set, any change to it is going to affect most every other user of the repository. With groupwanted there's a communicated intent in the name of the group, and there can be different groups with different versions of the preferred content expression. This lacks that, it encourages flag day events.
Yes, but the same is already possible for anyone with write access to a repo. I can
git annex wanted JOEYS-UUID nothing, wait for your assistant or manual sync to auto-drop all files (would also need to set{num,min}copiesto 1 for that, and even then it might not auto-drop it depending on the remotes). Anyone with write access to a repo can already freely change any group, groupwanted or wanted for any involved clone - if it's present in the git-annex branch (i.e. not made withgit config annex.private=true). So your concerns only apply to private repos that don't record their activity in the git-annex branch by usingannex.private=true. Making a git-annex repo private is a conscious, active choice. One does not need to do it if one only consumes files and does not have push access anyway. So that'll be people who actively change repo content, probably consume it, but don't want their repo to show up ingit annex info. Maybe for a publicly-pushable git-annex repo where everyone can add new files (who would host that anyway...). In this case, yes, users of that repo can't trust each other and there setting something likegit annex config --set annex.defaultwanted nothingat some point can lead to people'sgit annex sync|assist|assistantto suddenly drop their files - and probably also on the central remote. But I'd argue that this kind of publicly writable setup has so many other obvious problems thatannex.defaultwantedis one of the minor ones.Other situations I can imagine consider groups of people (or just single users) who trust each other when using a git-annex repo. git-annex is not designed to solve such permission problems - neither is git itself.
In your publicly readable (not writable) git-annex-builds repo on the other hand, if you were to set
git annex config --set annex.defaultwanted nothing, then people who just rungit annex sync|assist|assistantin their clones would have their downloaded builds dropped, okay.git-annex usage scenarios
git annex config --set annex.defaultwanted nothingat some point and other's clones would have files dropped on sync.git annex pullandgit annex getinstead ofsync|assist|assistant(which arguably makes more sense in this case anyway) or explicitly stating theirgit annex wanted here ....git annex config annex.defaultwantedcan be set as an established "repo policy" for everyone's convenience, that anyone can overwrite locally withgit annex wanted here ....git annex assist|sync|assistant|satisfy, you accept the repo's policy, as with yoursecurehashesonlyexample. If you're paranoid, don't use these sync commands, but do only exactly what you want such asgit annex pull -g,git annex get <thatfile>,git annex wanted ..., etc.A good point certianly.
Well also repos that lack permission to push or are simply not pushed to origin.
It's probably somewhat common to want to get files from origin, but not let origin make config changes that drop all the files they have previously shared.
Fair enough.
So I guess one can encourage users to include
git config --global annex.jobs 4andgit config annex.defaultwanted presentin their setup. Thanks for implementing that.Hmm, if the default always had "or present" added to it, at least the surprise drop would not be a concern.
I am going to change the names to "initwanted" etc as you suggested, to avoid closing off the possiblity of adding a global default later.
That is a very funny idea, I like it!