This is a fairly detailed design proposal for using git-annex to build http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK
end-user view
What the end user sees is a directory, with a .git subdirectory, and 100 thousand little files (actually, they're broken symlinks, on Linux/OSX). Over time, some of the symlinks start filling in with "random" content from the IA.
The user can look at that content, or even delete files they don't want to host.
The user can control how much total disk space the directory takes up. (It will use around 100 mb when empty.)
sharding to scale
The IA contains some 14 million Items. Inside these Items are 271 million files. Around 177 million of those are available for download.
git repositories do not scale well in the 1-10 million file range, and very badly above that. Storing all that in a git repository would strain git's scalability badly.
Solution: Create multiple git repositories, and split the files amoung them.
If each git repository holds 100 thousand files, that is 1770 repositories, which is not an unmanagable number. (For comparison, git.debian.org has 18500 repositories.)
The IA is ~20 Petabytes large. Each shard would thus be around 1 terabyte in size, although this will vary considerably.
Clients are assigned one or more shards, and clone those repositories.
A client decides which files in its shard to back up, and does so by running "git annex get" on them. This downloads the files over http from the IA.
A client will typically not back up its entire shard, but maybe only 500 gb or less of it. Also, we want redundancy (LOCKSS) -- say at least 3 copies of each file. So, a given shard will probably have between 3 and 9 clients handling it.
Add new shards as the IA continues to grow.
Problem: Need to get the checksums for the files, for git-annex to use. The census published by the IA only has md5sums in it. While git-annex can use md5sums, this allows bad actors to find md5 collisions with files from the archive, and upload bogus files that checksum ok when restoring.
creating a shard
This is a simple matter of making a git repository and telling git-annex the filenames and urls that belong in it.
A script can do this using the git annex fromkey
and git annex
registerurl
commands. Time to make such a repository with 100k files
is in the 10 minute range (faster on SSD or ramdisk).
adding a client
When a client registers to participate:
- Generate a UUID, which is assigned to this client, and send it to the client, and assign that UUID to a particular shard.
- Send the client an appropriate auth token (eg, a locked down ssh private key) to let them access the shard's git repository (or all the shards).
- Client clones its assigned shard git repository,
runs
git annex init reinit $UUID
.
Note that a client could be assigned to multiple shards, rather than just one. Probably good to keep a pool of empty shards that have clients waiting for new files to be added.
Note that we may want to enable direct mode in the client's clone,
because it lets the user easily delete files to free up space.
OTOH, direct mode is slow and less safe, so we might prefer to use indirect
mode, and then the client would need to use git annex drop
if they
decided to remove content.
distributing files
- Client runs
git annex sync --content
, which downloads as many files from the IA as will fit in their disk's free space (leaving some configurable amount free in reserve by configuring annex.diskreserve) - Note that numcopies and preferred content settings can be used to make clients only want to download an file if it's not yet reached the desired number of copies. Lots of flexibility here in git-annex.
- git-annex will push back to the server an updated git-annex branch, which will record when it has successfully stored an file.
bad actors
Clients can misbehave in probably many ways. The best defense for many misbehaviors is to distribute files to enough different clients that we can trust some of them.
The main git-annex specific misbehavior is that a client could try to push garbage information back to the origin repository on the server.
To guard against this, the server will reject all pushes of branches other than the git-annex branch, which is the only one clients need to modify.
Check pushes of the git-annex branch. There are only a few files that clients can legitimately modify, and the modifications will always involve that client's UUID, not some other client's UUID. Reject anything shady.
These checks can be done in a git update
hook. Rough estimate is that
such a hook would be a couple hundred lines of code.
verification
We want a lightweight verification process, to verify that a client still
has the data. This can be done using git annex fsck
, which can be
configured to eg, check each file only once per month.
git-annex will need a modification here. Currently, a successful fsck does not leave any trace in the git-annex branch that it happened. But we want the server to track when a client is not fscking (the user probably dropped out).
The modification is simple; just have a successful fsck update the timestamp in the fscked file's location log. It will probably take just a few hours to code.
With that change, the server can check for files that not enough clients have verified they have recently, and distribute them to more clients.
(This is now implemented.)
Note that bad actors can lie about this verification; it's not a proof they still have the file. But, a bad actor could prove they have a file, and refuse to give it back if the IA needed to restore the backup, too.
fire drill
If we really want to test how well the system is working, we need a fire drill.
- Pick some files that we'll assume the IA has lost in some disaster.
- Look up the shard the file belongs to.
- Get the git-annex key of the file, and tell git-annex it's been
lost from the IA, by running in its shard:
setpresentkey $key $iauuid 0
- The next time a client runs
git annex sync --content
, it will notice that the IA repo doesn't have the file anymore. The client will then send the file back to the origin repo. - To guard against bad actors, that restored file should be checked with
git annex fsck
. If its checksum is good, it can be re-injected back into the IA. (Or, the fire drill was successful.) (Remember to turn off the fire alarm by runningsetpresentkey $key $iauuid 1
)
shard servers
A server at the IA (or otherwise with a fast pipe) is needed to serve the shards. One server can probably manage them all. Let's consider what this server needs to have on it:
- git and git-annex
- ssh server
- The git repository for each shard. A few hundred mb per shard.
- The git update hook to filter out bad pushes.
- Some way to learn when a new user has registered to access a shard, so their ssh key is given access.
other optional nice stuff
The user running a client can delete some or all of their files at any
time, to free up disk space. The next time git-annex sync
runs on the client,
it'll notice and let the server know, and other clients will then take
over storing it. (Or if the git-annex assistant is run on the client,
it would inform the server immediately.)
The user is also free to move files around (within the git repository directory), modify files, view them, etc. This doesn't affect anyone else.
Offline storage is supported. As long as the user can spin it up from time
to time to run git annex fsck
.
More advanced users might have multiple repositories on different disks. Each has their own UUID, and they could move files around between them as desired; this would be communicated back to the origin repository automatically.
Shards could have themes, and users could request to be part of the shard that includes Software, or Grateful Dead, etc. This might encourage users to devote more resources.
Or, rather than doing a lucky dip and getting one or a couple shards, a user could clone em all, and pick just which files to get.
The contents of files sometimes changes.
This can be reflected by updating the file in the git repository.
Clients will then download the new version of the file. (They will also
tend to retain the old version, although this can be dealt with by using
git annex unused
).
Items sometimes go dark; this could be reflected by deleting the Item's files from the repository. It's up to the clients what they do with the content of such Items.
Client's repos could be put into groups to classify them. For example, there could be groups per continent, or for trust levels, or whatever. These can be used by preferred content expressions to fine tune how files are spread out amoung the available clients.
other potential gotchas
If any single file is very large (eg, 10 terabytes), there may not be any clients that can handle it. This could be dealt with by splitting up the file into smaller chunks. Word is there is a single 2 tb item, and a few more around 100 gb, so this is probably not a concern.
A client could add other files to its local repo, and git-annex branch pushes would include junk data about those files. It should probably be filtered out by the git update hook (rejecting the whole push because of this seems excessive).
There may be a thundering herd problem, where many clients end up
downloading the same file at the same time, and more copies than neecessary
result. The next git annex sync --content
in some of the
redundant clients will notice this and drop that file, and presumably
download some other file. It would be good to avoid this problem,
perhaps by having a new client initially download a random set of the
files in their shard that don't yet have enough copies.
With clients all fscking their part of a shard once a month,
that will increase the size of the git repository, with new distributed
fsck updates. I have run some test and this fsck overhead delta compresses
well. With a 10 thousand file repo and 100 clients all updating the
location log, the monthly fsck only added 1 mb to the repository size
(after git gc --aggressive
). Should scale linearly with number of files
in repo. Note that git annex forget
could be used to forget old
historical data if the repo grew too large from fsck updates.
and I would still maintain my view that removing intermediate directory withing .git/annex/objects whose current roles is simply to provide read-only protection might half the burden on the underlying file system, either annex repo(s) are multitude or a single one [1]. lean view [2] could also be of good use as well[2]. Similar exercises with simulated annex'es with >5M files also "helped" to identify problems with ZOL (ZFS on Linux) caching suggesting that even mere handling of such vast arrays of tiny files (as dead symlinks) might give filesystems a good test, so the leaner impact would be -- the better.
[1] e.g. https://github.com/datalad/datalad/issues/32#issuecomment-70523036 [2] https://github.com/datalad/datalad/issues/25
If I understand it correctly, 20PB at 2400 shards of 8TB each with 3 copies is 24TB/shard at 1TB/client is 2400*24 = ~60K clients assuming no churn. So it would probably need ~100K clients to cover the churn and have a good chance that each shard had 3 copies at all times. That's 1/3 the size of BOINC's active population.
It would take time to scale to that population. And it would take time to get three copies out of the Archive. During that time, the Archive is growing. The back of my envelope says that doing this in 2.5yrs roughly doubles the Archive's outbound bandwidth if you average it across the 2.5 years. But the population would grow slowly to start with, then faster, so that the bandwidth impact would be back-loaded. And at the end of the 2.5 years, you would need a lot more than the 100K users.
A design that used erasure coding or entanglement would reduce the storage and bandwidth demand considerably while providing adequate reliability.
The good news is that web-archive (.ARC) items are not publicly browsable, and that's about half of the archive's content, so you're only looking at about 10PB to backup.
The bad news is that unless you can work something out with archive.org (which seems unlikely; web-archive items are restricted to protect them legally), or use the old waybackup interface (which I don't think works anymore), or use the wayback machine (which last I heard only supported a few hundred connections per second) you'll only be able to back up half their data.
Still, non-web items seem like a nice place to start.