Recent comments posted to this site:

Thanks! Maybe still consider a repo-wide setting for default wanted content?

Hi joey, thank you for picking this up. IIUC, what you implemented (git config annex.default{wanted,required,group}) allows you to set these configs locally and then spare yourself the initial git annex wanted . present (etc.) setup calls. This is cool, thanks!

The problem I was trying to express here is however that git annex assist (the very convenient do-it-all command you can tell non-techy people to use to 'do the syncing stuff') will by default pull in all files, resulting in a terrible user experience: it's slow (of course nobody sets annex.jobs=cpus or uses -j4), it takes up a ridiculous amount of space, people will say 'I don't need that 3GB file, why does it download it?' (of course nobody remembers or understands to set git annex wanted . present or anything complex), etc. Sure, this is a question of user education, but good defaults can make for a much easier onboarding experience. (I know you are not so fond of such a do-it-all command, but this git annex assist single-stepping command really has been a good git annex selling point in the discussions and talks I had.)

So if there was a global setting like git annex config --set annex.defaultwanted 'present or include=*.pdf' that would set the default wanted expression for any clone, one could define what the most important files are and tell everyone to git annex get the others if necessary. git annex assist will be fast, only pull in the most important files (or none!), people can modify or add new stuff, and run git annex assist quickly again.

I would say git annex config --set annex.defaultwanted <whatever> should not execute git annex wanted . <whatever> and as such hard-code it in the git-annex branch for every repo (because then again, when would that even be executed? Would it be re-set after another git annex config --set annex.defaultwanted <whatever2>? When?). Instead, git annex --set annex.defaultwanted <whatever> should cause the default (i.e. fallback) value of git annex wanted . to be <whatever>, which is currently just "", which I guess means something like include=* IIRC.

Re: your security concerns

I understand your hesitation to add more git annex config ... global repo configs. But here I would argue:

  • git annex does not have a permissions model anyway. Anyone with push access to a repo can change any policy, any wanted expression for any repo, etc. If that is a problem, then git annex might not be the right tool. I guess one can implement some level of permission control with post-receive hooks on the remote side, but that is outside git annex's scope. git annex assumes everyone writing to the repo is nice.
  • I don't really understand your 'auditing' repo situation. Does it mean you regularly clone some repos, run git annex pull|assist in them to check if it still works? In that case the only negative thing git annex config --set annex.defaultwanted could do is indeed leaving you with less downloaded files. If one needs all files, git annex get --all has always been the way to go, hasn't it? 🤔 Or what kind of external repos from bad actors maliciously setting a default wanted expression do you 'audit'? And how is not having all files after git annex assist bad in this case?

Should you consider implementing git annex config --set annex.defaultwanted, it would conflict with the freshly introduced git config annex.defaultwanted local settings. We could rename those to git config annex.initdefaultwanted (or just annex.initwanted), to emphasize that those only happen ongit annex init. Thengit annex config --set annex.defaultwanted` would also sound very sensible to me in contrast, as it really configures the default, and does not modify individual repos.

Cheers, Yann

Comment by nobodyinperson
comment 7

The automatic init that git-annex does in a clone does enter adjusted branch. I think I was not considering that because you were talking about having an existing repository and git-annex entering the adjusted branch later.

We can reopen this if you want, unsure.

Comment by joey
Re: buyer's remorse

Oh good question!

This gets a tiny bit into internals, but .git/annex/journal-private/ is where the private information is stored. If you move the files from there into .git/annex/journal/, they will be committed on the next run of git-annex.

You would need to take care to avoid overwriting any existing files in the journal, usually there won't be any though.

Also unset annex.private of course.

Comment by joey
comment 2

I'm inclined to agree with you, it's probably a problem with https://hackage.haskell.org/package/disk-free-space

I am not going to be able to reproduce this!

Could you take a look at disk-free-space in ghci and see if it reproduces there?

ghci> import System.DiskSpace
ghci> getAvailSpace "/"
283744563200
ghci> getDiskUsage "/"
DiskUsage {diskTotal = 501386043392, diskFree = 283761369088, diskAvail = 283744591872, blockSize = 4096}

Looking at the code, it assumes bsize and frsize are CULong. I guess it's that or FsBlkCnt is somehow wrong.

Comment by joey
comment 1

The assistant only sends files to repositories that want them. This is not guaranteed to make as many copies of the files as whatever you have numcopies configured to. (Numcopies will prevent the assistant from dropping a file from a repository if there are not enough copies.)

All of your archive repositories only want 1 copy of a file across all of them, so you would need 2 backup repositories (which want all files) in order to get to 3 copies.

Comment by joey
comment 1

There are two possibilities:

  1. Transfer repositories want files that have not yet reached all clients, so maybe you had a second client repository that doesn't have the file yet.

  2. When there is only a single client repository, transfer repositories want to contain all content, even once it's reached that client. The assumption is that, since the purpose of a transfer repo is to transfer between clients, there will be a second client repository added at some point, and then the trasfer repository will have the content to send it it.

This is documented in standard groups.

Comment by joey
Re: comment 6

git-annex findcomputed --inputs is documented to output one line per input file. If it doesn't behave that way, file a bug.

It would be possible to run git-annex commands in the compute script if you were able to determine where the git repository was. I don't think git-annex sets anything in the environment that will help with that currently.

If the compute program set metadata though, it would re-set the same metadata when it's used to recompute the files. That might be undesirable behavior if the user has edited the metadata in the meantime.

Comment by joey
comment 2

I tend to agree, this adds a lot of potential for foot shooting.

It might make sense an an option that enables acting on non-annexed files?

Comment by joey
comment 1

I think that will work!

Since moving content between the archive drives is probably reasonably fast, it might make sense to use fullybalanced or fullysizebalanced.

In any case, when using "balanced" things, you will need to use git-annex-maxsize to tell it how large each repository is.

Comment by joey
comment 6

Implemented git configs annex.defaultwanted, annex.defaultrequired, and annex.defaultgroups.

Comment by joey