motivations

Say we have 2 drives and want to fill them both evenly with files, different files in each drive. Currently, preferred content cannot express that entirely:

  • One way is to use a-m and n-z, but that's unlikely to split filenames evenly.
  • Or, can let both repos take whatever files, perhaps at random, that the other repo is not know to contain, but then repos will race and both get the same file, or similarly if they are not communicating frequently. Existing preferred content expressions such as the one for archive group have this problem.

So, let's add a new expression: balanced=group

how it works

To decide which repository wants key K:

A is the list of UUIDs of all the repositories in the group, in ascending order.

B is A filtered to repositories that have enough free space to store key K. (Needs track free space in repos via git-annex branch to be implemented.)

S is the concacenation of each UUID in A.

N is the HMAC-SHA256 of K and S, with S being the "secret key" and K being the message.

M is the number of repositories in B.

Then N mod M is the index of the repository in B that wants key K.

The purpose of using HMAC-SHA256 here is mostly to evenly distribute amoung the repositories, since git-annex keys can be pretty long and do not always contain hashe. Also, including the concacenation of all the UUIDs of reposotories in the group makes it harder to generate a combination of key and repository UUID that makes that repository want to contain the key.

size balancing

Beyond filtering balanced repos to those that have enough free space to store a key, the sizes of the repos can be used in other ways.

If the maximum size of all the balanced repos is known, they can be filled with data proportional to their sizes.

If the maximum size is not known of any balanced repos, a reasonable thing to do would be to assume they all have the same maximum size, and fill them proportionally also.

If the maximum size of some but not others is known, what then?

Balancing this way would fall back to the method above when several repos are equally good candidates to hold a key.

The problem with size balancing is that in a split brain situation, the known sizes are not accurate, and so one repository will end up more full than others. Consider, for example, a group of 2 repositories of the same size, where one repository is 50% full and the other is 75%. Sending files to that group will put them all in the 50% repository until it gets to 75%. But if another clone is doing the same thing and sending different files, the 50% full repository will end up 100% full.

Rebalancing could fix that, but it seems better generally to use N mod M balancing amoung the repositories known/believed to have enough free space.

stability

Note that this preferred content expression will not be stable. A change in the members of the group will change which repository is selected. And changes in how full repositories are will also change which repo is selected.

Without stability, when another repo is added to the group, or a repository becomes full, all data will be rebalanced. Which could be desirable in some situations, but the problem is that it's likely that adding repo3 will make repo1 and repo2 want to swap some files between them,

So, we'll want to add some precautions to avoid a lot of data moving around in such a case:

(balanced=backup and not (copies=backup:1)) or present

So once file lands on a backup drive, it stays there, even if more backup drives change the balancing.

use case: 3 of 5

What if we have 5 backup repos and want each key to be stored in 3 of them? There's a simple change that can support that: balanced=group:3

This works the same as before, but rather than just N mod M, take N+I mod M where I is [0..2] to get the list of 3 repositories that want a key.

However, once 3 of those 5 repos get full, new keys will only be able to be stored on 2 of them. At that point one or more new repos will need to be added to reach the goal of each key being stored in 3 of them.

It would be possible to rebalance the 3 full repos by moving some keys from them to the other 2 repos, and eke out more storage before needing to add new repositories. A separate rebalancing pass, that does not use preferred content alone, could be implemented to handle this (see below).

use case: geographically distinct datacenters

Of course this is not limited to backup drives. A more complicated example: There are 4 geographically distributed datacenters, each of which has some number of drives. Each file should have 1 copy stored in each datacenter, on some drive there.

This can be implemented by making a group for each datacenter, which all of its drives are in, and using balanced to pick the drive that holds the copy of the file. The preferred content expression would be eg:

(balanced=datacenterA and not copies=datacenterA:1) or present

In such a situation, to avoid a N^2 remote interconnect, there might be a transfer repository in each datacenter, that is in front of its drives. The transfer repository should want files that have not yet reached the destination drive. How to write a preferred content expression for that? It might be sufficient to use copies=datacenterA:1, so long as the file reaching any drive in the datacenter is enough. But may want to add something analagous to inallgroup= that checks if a file is in the place that balanced picks for a group. Eg, balancedgroup=datacenterA for 1 copy and balancedgroup=group:datacenterA:2 for N copies.

The passthrough proxy idea is an alternate way to put a repository in front of such a cluster, that does not need additional extensions to preferred content.

split brain situations

Of course, in the time after the git-annex branch was updated and before it reaches the local repo, a repo can be full without us knowing about it. Stores to it would fail, and perhaps be retried, until the updated git-annex branch was synced.

In the worst case, a split brain situation can make the balanced preferred content expression pick a different repository to hold two independent stores of the same key. Eg, when one side thinks one repo is full, and the other side thinks the other repo is full.

If present is used in the preferred content, both of them will then want to contain it. (Is present really needed like shown in the examples above?)

If it's not, one of them will drop it and the other will usually maintain its copy. It would perhaps be possible for both of them to drop it, leading to a re-upload cycle. This needs some research to see if it's a real problem. See proving preferred content behavior.

rebalancing

In both the 3 of 5 use case and a split brain situation, it's possible for content to end up not optimally balanced between repositories.

(There are also situations where a cluster node ends up without a copy of a file that is preferred content, or where adding a copy to a node would satisfy numcopies. This can happen eg, when a client sends a file to a single node rather than to the cluster. Rebalancing also will deal with those.)

git-annex can be made to operate in a mode where it does additional work to rebalance repositories.

This can be an option like --rebalance, that changes how the preferred content expression is evaluated. The user can choose where and when to run that. Eg, it might be run on a node inside a cluster after adding more storage to the cluster.

In several examples above, we have preferred content expressions in this form:

(balanced=group:N and not copies=group:N) or present

In order to rebalance, that needs to be changed to:

balanced=group:N

What could be done is make balanced() usually expand to the former, but when --rebalance is used, it only expands to the latter.

(Might make the fully balanced behavior available as fullybalanced for users who want it, then balanced=group:N == (fullybalanced=group:N and not copies=group:N) or present usually and when --rebalance is used, balanced=group:N == fullybalanced=group:N)

In the balanced=group:3 example above, some content needs to be moved from the 3 full repos to the 2 less full repos. To handle this, fullybalanced=group:N needs to look at how full the repositories in the group are. What could be done is make it use size based balancing when rebalancing `group:N (>1)

While size based balancing generally has problems as described above with split brain, rebalancing is probably run in a single repository, so split brain won't be an issue.

Note that size based rebalancing will need to take into account the size if the content is moved from one of the repositories that contains it to the candidate repository. For example, if one repository is 75% full and the other is 60% full, and the annex object in the 75% full repo is 20% of the size of the repositories, then it doesn't make sense to make the repo that currently contains it not want it any more, because the other repo would end up more full.