motivating examples
Preferred content expressions can be complicated to write and reason about. A complex expression can involve lots of repositories that can get into different states, and needs to be written to avoid unwanted behavior.
It would be very handy to provide some way to prove things about behavior of preferred content expressions, or a way to simulate the behavior of a network of git-annex repositories with a given preferred content configuration
For example, consider two reposities A and B. A is in group M and B is in
group N. A has preferred content not inallgroup=N
and B has not inallgroup=M
.
If A contains a file, then B will want to also get a copy. And things stabilize there. But if the file is removed from A, then B also wants to remove it. And once B has removed it, A wants a copy of it. And then B also wants a copy of it. So the result is that the file got transferred twice, to end up right back where we started.
The worst case of this is not present
, where the file gets dropped and
transferred over and over again. The docs warn against using that one. But
they can't warn about every bad preferred content expression.
balanced preferred content
When balanced preferred content is added, a whole new level of complexity will exist in preferred content expressions, because now an expression does not make a file be wanted by a single repository, but shards the files amoung repositories in a group.
And up until this point preferred content expressions have behaved the same no matter the sizes of the underlying repositories, but balanced preferred content does take repository fullness into account, which further complicates fully understanding the behavior.
Notice that balanced()
(in the current design) is not stable when used
on its own, and has to be used as part of a larger expression to make it
stable, eg:
((balanced(backup) and not (copies=backup:1)) or present
So perhaps balanced()
should include the other checks in it,
to avoid the user shooting themselves in the foot. On the other
hand, if balanced()
implicitly contains present
, then not balanced()
would include not present
, which is bad!
(For that matter, what does not balanced()
even do currently?)
proof
What could be proved about a preferred content expression?
No idea really. Would be interesting to consider what formal methods can do here. Could a SAT solver be used somehow for example?
static analysis
Clearly not present
is an problematic preferred content expression. It
would be good if git-annex warned and/or refused to set such an expression
if it could detect it. Similarly not groupwanted
could be detected as a
problem when the group's preferred content expression contains present
.
Is there is a more general purpose and not expensive way to detect such
problematic expressions, that can find problems such as the
not inallgroup=N
example above?
simulation
Simulation seems fairly straightforward, just simulate the network of git-annex repositories with random files with different sizes and metadata. Be sure to enforce invariants like numcopies the same as git-annex does.
Since users can write preferred content expressions, this should be targeted at being used by end users.