I'm looking for a way to count/track annex data copied to LTO tapes.
In my use case I'm splitting many terabytes of data to separate repositories on several hard drives, backing them up on tapes, and keeping track of all this with remotes in a central repository (with lots of dropped content). All good with the HDs, but I'm at a loss about how to handle the copies on tapes.
Three alternatives I've come up with so far:
- Simply tar the repositories from the HDs to the tapes. Problem: no way to notify git annex of the existence of these manual copies. Or is there?
- Remote (special or normal) on LTFS (linear posix compatible file system on top of tape). Problems:
git annex get
ing a dropped directory from there would cause files to be accessed in random order, right? Or is the retrieve guaranteed to happen in the same order as the files in the directory were written bygit annex copy
?- LTFS has a big block size (512KB) => wasted space when lots of small files. (Not a major problem, though.)
- Write a special (read-only) remote hook for
tar
. Problem:git annex get
would make one hook RETRIEVE request per file, leading to random access again, while the only effective way would be to get a list of all files to be retrieved, and then returning them in the order they turn up from the tar package (or even ingest the whole tar file to .git/annex/).
Thoughts?
All your solutions are reasonable, but you're right that none of them avoid random-access. And git-annex is generally not built in a way that makes it easy to avoid random-access.
But, I think it's tractable with a special remote.
Consider a special remote that has two retrieval modes. In one mode, it always fails to retrieve keys, but keeps a list. In the second mode, it starts up by going through its list, retreives everything in order to a temporary directory, and then when asked to retrieve a key, just moves it from the temp dir into place. This is somewhat similar to how Amazon Glacier works, and like git-annex's glacier support, it would result in the first
git-annex get
failing, and a secondgit annex get
being needed to finish the retrival.That could be improved.. Make the special remote fail to retrive keys, and keep a list. On shutdown, it then sorts the list, retrieves the keys in order, and runs
git annex setkey
to move the content into the annex. Still a little bit weird, becausegit annex get
would seem to fail and then pause at the end for a long time, after which the files would actually end up being present.(Also, I er, removed
git annex setkey
in 2011, because it didn't seem very useful, but this is in fact a use case for it, so I've added it back now.)You could make a special remote that streams the whole tar file from the tape, and uses
git annex setkey
to add each file from the tarball to the annex.Done this way, the first file that
git annex get
processed would actually cause every file to be gotten from the tape. As it continued on to subsequent files, thegit annex get
would see their content was already present and skip them.Of course, the downside is it works on a whole tape at a time, so if you don't want to load the whole tape into the filesystem, you wouldn't want to use this approach.