I am currently working on a new special remote for storing git-annex objects on tape media.
In my setup every tape cartridge is tracked by git-annex as a dedicated special remote.
All these remotes are part of a new tape
group.
I would like to use a preferred content expression similar to the archive
standard group: (not copies=tape:1) or approxlackingcopies=1
.
However, with having many tapes (remotes) which would match this expression, I would like to choose only one of the as the target (and always the same one) until it is full.
This is necessary, as I need to avoid frequently swapping cartridges from the tape drive to minimize wear.
It sounds like maybe the new
sizebalanced=tape:1
expression could help here? 🤔 But if I understand correctly, it would try to fill the tapes up equally, which is not what you want. There would need to be something likesequential=tape:1
, which doesn't want to balance the annexes in terms of size, but just in order. But what order? 🤔 Ordered by descending filled annex size? That would be what you need I think.git-annex-preferred-content
I think this is the same system that there will be a talk about at Distribits 2025? I have been looking forward to that talk.
@nobodyinperson seems on the right track with the
sequential=tape:1
idea. And it seems fairly easy to implement using the same building blocks assizebalanced
.Relatedly, I wonder about sequential reading when a big
git-annex get
is run. Do you have some solution for that in mind? I could imagine doing something similar to Amazon Glacier, where the first get of a file fails, but is queued for later retrival from tape, allowing multiple requests to be ordered more efficiently.As to the ordering, at first I thought it would make sense for it to pick the most full repository that still has space for a file.
But: Suppose that the files being processed alternate between large, and small. The fullest tape is too full for any of the large files, but it can hold all the small files. The second fullest tape has plenty of room. In this case, it would constantly switch back and forth between the two tapes.
sizebalanced picks the least full repository. That's not what we want either, clearly, since it alternates between repositories frequently when they're near the same size.
The optimal solution is for git-annex to remember what repository was used to store the last file, and can just use that repository again. Unless it's full, in which case it can pick any repository that still has space. And then it will continue to use that new repository for subsequent files.
That memory would necessarily be local to a repository in front of these tape remotes. (Eg, a cluster gateway). If there were multiple repositories that were all writing to the same tape remotes, they would each have their own memory, and chaos would ensue.
Needing a memory makes me a bit dubious about putting this in a preferred content expression. But in your specific case, I guess it would work.
Another approach would be to configure
remote.<name>.annex-cost-command
with a command that gives a low cost to the tape in the drive, and a high cost to other tapes.But git-annex only checks the cost once at startup. It would need to check it again after each file. Which could be a new configuration setting. You would need to make the cost command efficent enough that running it once per file is not too slow.
With this approach, the standard archive group preferred content would probably suffice.
Thanks a lot @joey & @nobodyinperson for your input
Yes exactly
I am still working on the code. But having a deadline is sometimes helpful 
I am using the approach proposed by you in this post: https://git-annex.branchable.com/forum/Storing_copies_on_LTO_tapes63/ As you noted, this is quite similar to how Glacier is handled.
And yes, it would also allow batching together multiple
git-annex get
into a single sequential pass over the tape. I would like to also support batching together objects originating from multiple git-annex repos.But this would make it pretty difficult to track the available capacity per tape cartridge as multiple git-annex repos would contribute (or even other non git-annex files).
LTO tapes are a bit special, as they are append-only. The available capacity will only decreases when new objects are added. The only option to regain capacity is by erasing the tape. If this happens, I am marking the git-annex remote as dead and initialize a new fresh remote.
I now realized, that I can use this fact to detect the first EOT (end of tape) error for each tape and then update its preferred content expression..
Oh that sounds really interesting. But how is this related to the
GETCOST
&GETAVAILABILITY
messages of the external special remote protocol?It seems like that the remote's cost could be a way to define the order in which the remotes are filled?
Its a lot to digest. I will start testing and playing around with your ideas.
Thanks