Recent comments posted to this site:
If annex.defaultwanted were able to be changed for all repositories with
git-annex config, then here's a really ugly security problem:
- First, I make sure to get a copy of every annexed file.
- Then I run
git-annex config annex.defaultwanted nothing - Then I wait for git-annex to drop every file from your repository.
- Finally, I demand $ to get your files back.
Now, the same can be done by convincing people to add their repository to some group and set preferred content to "standard", and later changing the groupwanted. But that only works on people you were able to social engineer to doing that, not everyone who cloned a repository with the default settings.
And beyond the ransom problem, there's the problem that once this is set, any change to it is going to affect most every other user of the repository. With groupwanted there's a communicated intent in the name of the group, and there can be different groups with different versions of the preferred content expression. This lacks that, it encourages flag day events.
Note that you can set annex.defaultwanted to "standard", and
annex.defaultgroups to some group, and then changing
git-annex groupwanted will affect all repositories that copied that
defaultwanted into their config.
So that's a way to be able to make changes that will affect other people's clones. But only ones that they have opted into.
I'm on the fence about whether the kind of security impact I discussed earlier is really something that should prevent a global setting, or not.
git-annex config of annex.securehashesonly is another example of
something where my hypothetical "auditing repos" would be vulnerable to a
behavior change that might be security significant. Since that gets copied
from the git-annex config to git config at init time, behavior in a
new clone might be different than behavior in an existing clone.
Does that mean it's ok for there to be more cases where there can be such a potential security impact? I don't know.
The annex-ignore config can be manually set by the user to prevent using an otherwise usable remote. The man page gives the example of a network connection that is too slow to use normally.
It may be that no users are actually using annex-ignore like this. Using annex-sync seems more likely. But, it's hard to rule out.
That presents a problem, since this would need to unset annex-ignore once the repository was created.
Checking before push if the repository exists, and only unsetting annex-ignore if it did not exist before sync, but does afterwards, would be one way around this problem. It does mean that, if 2 people are making a repository at the same location at the same time, the loser may be left with annex-ignore set due to the other person having created the repository.
Or, a new config could be added, that is like annex-ignore, but is only set by git-annex, and not by the user. Keeping annex-ignore's behavior, but making git-annex set and unset the new config as needed.
Hi joey, thank you for picking this up. IIUC, what you implemented (git config annex.default{wanted,required,group}) allows you to set these configs locally and then spare yourself the initial git annex wanted . present (etc.) setup calls. This is cool, thanks!
The problem I was trying to express here is however that git annex assist (the very convenient do-it-all command you can tell non-techy people to use to 'do the syncing stuff') will by default pull in all files, resulting in a terrible user experience: it's slow (of course nobody sets annex.jobs=cpus or uses -j4), it takes up a ridiculous amount of space, people will say 'I don't need that 3GB file, why does it download it?' (of course nobody remembers or understands to set git annex wanted . present or anything complex), etc. Sure, this is a question of user education, but good defaults can make for a much easier onboarding experience. (I know you are not so fond of such a do-it-all command, but this git annex assist single-stepping command really has been a good git annex selling point in the discussions and talks I had.)
So if there was a global setting like git annex config --set annex.defaultwanted 'present or include=*.pdf' that would set the default wanted expression for any clone, one could define what the most important files are and tell everyone to git annex get the others if necessary. git annex assist will be fast, only pull in the most important files (or none!), people can modify or add new stuff, and run git annex assist quickly again.
I would say git annex config --set annex.defaultwanted <whatever> should not execute git annex wanted . <whatever> and as such hard-code it in the git-annex branch for every repo (because then again, when would that even be executed? Would it be re-set after another git annex config --set annex.defaultwanted <whatever2>? When?). Instead, git annex --set annex.defaultwanted <whatever> should cause the default (i.e. fallback) value of git annex wanted . to be <whatever>, which is currently just "", which I guess means something like include=* IIRC.
Re: your security concerns
I understand your hesitation to add more git annex config ... global repo configs. But here I would argue:
- git annex does not have a permissions model anyway. Anyone with push access to a repo can change any policy, any wanted expression for any repo, etc. If that is a problem, then git annex might not be the right tool. I guess one can implement some level of permission control with post-receive hooks on the remote side, but that is outside git annex's scope. git annex assumes everyone writing to the repo is nice.
- I don't really understand your 'auditing' repo situation. Does it mean you regularly clone some repos, run
git annex pull|assistin them to check if it still works? In that case the only negative thinggit annex config --set annex.defaultwantedcould do is indeed leaving you with less downloaded files. If one needs all files,git annex get --allhas always been the way to go, hasn't it? 🤔 Or what kind of external repos from bad actors maliciously setting a default wanted expression do you 'audit'? And how is not having all files aftergit annex assistbad in this case?
Should you consider implementing git annex config --set annex.defaultwanted, it would conflict with the freshly introduced git config annex.defaultwanted local settings. We could rename those to git config annex.initdefaultwanted (or just annex.initwanted), to emphasize that those only happen ongit annex init. Thengit annex config --set annex.defaultwanted` would also sound very sensible to me in contrast, as it really configures the default, and does not modify individual repos.
Cheers, Yann
The automatic init that git-annex does in a clone does enter adjusted branch. I think I was not considering that because you were talking about having an existing repository and git-annex entering the adjusted branch later.
We can reopen this if you want, unsure.
Oh good question!
This gets a tiny bit into internals, but .git/annex/journal-private/ is
where the private information is stored. If you move the files from there
into .git/annex/journal/, they will be committed on the next run of
git-annex.
You would need to take care to avoid overwriting any existing files in the journal, usually there won't be any though.
Also unset annex.private of course.
I'm inclined to agree with you, it's probably a problem with https://hackage.haskell.org/package/disk-free-space
I am not going to be able to reproduce this!
Could you take a look at disk-free-space in ghci and see if it reproduces there?
ghci> import System.DiskSpace
ghci> getAvailSpace "/"
283744563200
ghci> getDiskUsage "/"
DiskUsage {diskTotal = 501386043392, diskFree = 283761369088, diskAvail = 283744591872, blockSize = 4096}
Looking at the code, it assumes bsize and frsize are CULong. I guess it's that or FsBlkCnt is somehow wrong.
The assistant only sends files to repositories that want them. This is not guaranteed to make as many copies of the files as whatever you have numcopies configured to. (Numcopies will prevent the assistant from dropping a file from a repository if there are not enough copies.)
All of your archive repositories only want 1 copy of a file across all of them, so you would need 2 backup repositories (which want all files) in order to get to 3 copies.
There are two possibilities:
Transfer repositories want files that have not yet reached all clients, so maybe you had a second client repository that doesn't have the file yet.
When there is only a single client repository, transfer repositories want to contain all content, even once it's reached that client. The assumption is that, since the purpose of a transfer repo is to transfer between clients, there will be a second client repository added at some point, and then the trasfer repository will have the content to send it it.
This is documented in standard groups.