proving preferred content behavior

Preferred content expressions can be complicated to write and reason about. A complex expression can involve lots of repositories that can get into different states, and needs to be written to avoid unwanted behavior.

It would be very handy to provide some way to prove things about behavior of preferred content expressions, or a way to simulate the behavior of a network of git-annex repositories with a given preferred content configuration

motivating examples

For example, consider two reposities A and B. A is in group M and B is in group N. A has preferred content not inallgroup=N and B has not inallgroup=M.

If A contains a file, then B will want to also get a copy. And things stabilize there. But if the file is removed from A, then B also wants to remove it. And once B has removed it, A wants a copy of it. And then B also wants a copy of it. So the result is that the file got transferred twice, to end up right back where we started.

The worst case of this is not present, where the file gets dropped and transferred over and over again. The docs warn against using that one. But they can't warn about every bad preferred content expression.

balanced preferred content

When balanced preferred content is added, a whole new level of complexity will exist in preferred content expressions, because now an expression does not make a file be wanted by a single repository, but shards the files amoung repositories in a group.

And up until this point preferred content expressions have behaved the same no matter the sizes of the underlying repositories, but balanced preferred content does take repository fullness into account, which further complicates fully understanding the behavior.

Notice that balanced() (in the current design) is not stable when used on its own, and has to be used as part of a larger expression to make it stable, eg:

((balanced(backup) and not (copies=backup:1)) or present

So perhaps balanced() should include the other checks in it, to avoid the user shooting themselves in the foot. On the other hand, if balanced() implicitly contains present, then not balanced() would include not present, which is bad!

(For that matter, what does not balanced() even do currently?)

proof

What could be proved about a preferred content expression?

No idea really. Would be interesting to consider what formal methods can do here. Could a SAT solver be used somehow for example?

static analysis

Clearly not present is an problematic preferred content expression. It would be good if git-annex warned and/or refused to set such an expression if it could detect it. Similarly not groupwanted could be detected as a problem when the group's preferred content expression contains present.

Is there is a more general purpose and not expensive way to detect such problematic expressions, that can find problems such as the not inallgroup=N example above?

simulation

Simulation seems fairly straightforward, just simulate the network of git-annex repositories with random files with different sizes and metadata. Be sure to enforce invariants like numcopies the same as git-annex does.

Since users can write preferred content expressions, this should be targeted at being used by end users.