proving preferred content behavior

motivating examples
balanced preferred content
proof
static analysis
simulation

motivating examples

Preferred content expressions can be complicated to write and reason about. A complex expression can involve lots of repositories that can get into different states, and needs to be written to avoid unwanted behavior.

It would be very handy to provide some way to prove things about behavior of preferred content expressions, or a way to simulate the behavior of a network of git-annex repositories with a given preferred content configuration

The worst case of this is not present, where the file gets dropped and transferred over and over again. The docs warn against using that one. But they can't warn about every bad preferred content expression.

Mostly, git-annex manages to keep things stable that seem like they would not be. Consider repo A that is not in group foo, and B is in group foo. A has preferred content "onlyingroup=foo". This will make A want a file that is in B. And once it has it, it will not want to drop it. That's because when dropping, it considers if it would be preferred content after the drop. In this case it would, so it doesn't drop it.

balanced preferred content

When balanced preferred content is added, a whole new level of complexity will exist in preferred content expressions, because now an expression does not make a file be wanted by a single repository, but shards the files amoung repositories in a group.

And up until this point preferred content expressions have behaved the same no matter the sizes of the underlying repositories, but balanced preferred content does take repository fullness into account, which further complicates fully understanding the behavior.

Notice that fullybalanced() is not stable when used on its own, and so balanced() adds an "or present" to stabilize it. And so not balanced() includes not present, which is bad!

proof

What could be proved about a preferred content expression?

No idea really. Would be interesting to consider what formal methods can do here. Could a SAT solver be used somehow for example?

static analysis

Clearly not present is an problematic preferred content expression. It would be good if git-annex warned and/or refused to set such an expression if it could detect it. Similarly not groupwanted could be detected as a problem when the group's preferred content expression contains present.

This is now detected and such an unstable expression never matches. --debug explains why too.

Note that the detection will not be trigged by "not (not present)", but it will by "include=* or (not present)" even though that is always stable, because "include=*" always matches and so what it's ORed with doesn't matter. Probably noone will set something like that in real life though.

It's problimatic to make git-annex wanted warn about it. Consider if in one repository, groupwanted is set to "present". In another repository, which is disconnected, wanted is set to "not groupwanted". Both operations are ok, but upon merging the two repositories, the combined effect is that "not present" has been set.

So while it could warn sometimes on setting "not present", it would sometimes not be able to. Better to not warn inconsistently. --Joey

simulation

Simulation seems fairly straightforward, just simulate the network of git-annex repositories with random files with different sizes and metadata. Or use the current files and metadata. Be sure to enforce invariants like numcopies the same as git-annex does.

Since users can write preferred content expressions, this should be targeted at being used by end users.

The sim could be run in a clone of a repository, and update location logs as it runs. This would let the user query with whereis and maxsize etc to see what happens.

Such a repository's location tracking would no longer match reality, so it would need to be clearly marked as a simulation result, and be prevented from merging back into another repository. This can be done by adding a new Difference to the repository.

The sim would need a map of repositories with connections between them. Perhaps git-annex map could be used?

For each step of the sim, it would pick a repository from the map (excluding special remotes), and simulate an operation being run in that repository, affecting it and its remotes.

Split brain needs to be simulated, so the operations in the sim should include pushing and fetching the git-annex branch. The ref of each git-annex branch of each repository would be stored, with refs/heads/git-annex being set to the git-annex branch of the repository it is currently simulating an operation in.

The other operations would include get, drop, copy, move, sync, all with preferred content respected.

May want to also simulate adding files to a repository, which would be generated (without any actual content) according to simulation parameters. Also file moves and deletions. git-annex fuzztest has some prior art.

The location log history could be examined at the end of the simulation to find problems like instability.

done, see git-annex sim command. --Joey