Recent comments posted to this site:
As part of removing the webapp, I patched Alerts out of the assistant in 33cf88c8b8962a7f5d3b3caada95890d5f4d377e.
It did occur to me that logging the text of the Alert might make the assistant's log more useful. That commit would be an easy starting point to adding such logging.
I don't think it solves Recent remote activities though because it would only show activity by the assistant, not by other commands, and not activity that happened in other clones of the repository.
Another approach would be to configure remote.<name>.annex-cost-command
with a command that gives a low cost to the tape in the drive, and a high
cost to other tapes.
But git-annex only checks the cost once at startup. It would need to check it again after each file. Which could be a new configuration setting. You would need to make the cost command efficent enough that running it once per file is not too slow.
With this approach, the standard archive group preferred content would probably suffice.
As to the ordering, at first I thought it would make sense for it to pick the most full repository that still has space for a file.
But: Suppose that the files being processed alternate between large, and small. The fullest tape is too full for any of the large files, but it can hold all the small files. The second fullest tape has plenty of room. In this case, it would constantly switch back and forth between the two tapes.
sizebalanced picks the least full repository. That's not what we want either, clearly, since it alternates between repositories frequently when they're near the same size.
The optimal solution is for git-annex to remember what repository was used to store the last file, and can just use that repository again. Unless it's full, in which case it can pick any repository that still has space. And then it will continue to use that new repository for subsequent files.
That memory would necessarily be local to a repository in front of these tape remotes. (Eg, a cluster gateway). If there were multiple repositories that were all writing to the same tape remotes, they would each have their own memory, and chaos would ensue.
Needing a memory makes me a bit dubious about putting this in a preferred content expression. But in your specific case, I guess it would work.
I think this is the same system that there will be a talk about at Distribits 2025? I have been looking forward to that talk.
@nobodyinperson seems on the right track with the sequential=tape:1
idea.
And it seems fairly easy to implement using the same building blocks as
sizebalanced
.
Relatedly, I wonder about sequential reading when a big git-annex get
is run. Do you have some solution for that in mind? I could imagine doing
something similar to Amazon Glacier, where the first get of a file fails,
but is queued for later retrival from tape, allowing multiple requests to
be ordered more efficiently.
It sounds like maybe the new sizebalanced=tape:1
expression could help here? 🤔 But if I understand correctly, it would try to fill the tapes up equally, which is not what you want. There would need to be something like sequential=tape:1
, which doesn't want to balance the annexes in terms of size, but just in order. But what order? 🤔 Ordered by descending filled annex size? That would be what you need I think.
These are all valid reasons to retire the webapp. The webapp lacks many features that it would need to be really useful. Also creation of new repos or addition of existing repos into the webapp is not as straightforward as it should be to make it similar in usability like e.g. syncthing.
I do still use it for shared family folders on my and their machines. It's nice to have something to tell people to click on, then something happens and they can see if syncing works or does anything. git annex info
is not quite the same, though it shows active transfers.
What I would love to see as a replacement for the webapp is a command like git annex assistant-status
that maybe outputs as json of human-readable text what the assitant currently does (pulling, merging, pushing to which remote, downloading, uploading, etc.), all the stuff that was nicely visible in the webapp. (Does this exist already? 🤔)
Furthermore, a command like git annex activity
that goes arbitrarily far back in time and statically (non-live) lists recent activities like:
- yesterday 23:32: remote1 downloaded 5 files (45MB)
- today 10:45: you modified file
document.txt
(10MB) - today 10:46: you uploaded file
document.txt
(from today 10:45) to remote1, remote2 and remote3 - today 12:35: Fred McGitFace modified file
document.txt
(12MB) and uploaded to remote2 - ...
Basically a human-readable (or as JSON), chronological log of things that happened in the repo. This is a superpower of git-annex: all this information is available as far back as one wants, we just don't have a way to access it nicely. git log
and git annex log
exist, but they are too specific, too broad or a bit hard to parse on their own. For example:
git annex activity --since="2 weeks ago" --include='*.doc'
would list things (who committed, which remote received it, etc.) that happened in the last two weeks to *.doc filesgit annex activity --only-annex --in=remote2
would list recent annex operations (in thegit-annex
branch only) of remote2git annex activity --only-changes --largerthan=10MB
would list recent file changes (additions, modifications, deletions, etc., ingit log
only)
This git annex assistant-log
and git annex activity
would be a very nice feature to showcase git-annex's power (which other file syncing tool can to this? 🤔) and also solve Recent remote activities.
I see, in my case I have no git lock files but rather a lock file for our process:
reprostim@reproiner:/data/reprostim$ ls -ld .git/*.lock
-rw-r--r-- 1 reprostim reprostim 0 May 8 13:51 .git/reprostim-videocapture.lock
which I guess git-annex
treats as a git lock file. Is there a way to make them two play nicely without me coming up with some alternative location which is to be ignored by git but local to this repository? May be only known to belong to git
lock files should be considered? Or may be me placing it under .git/reprostim-videocapture/lock
would be satisfactory? (do not want to interrupt ATM - doing useful stuff)
After all both of them "are not git" (in that they both also use .git/
space for their own needs)
git_status
module in starship. I know this is not the best solution, but I was able to stop this issue by increasing the value of command_timeout
in my starship config. Another potential solution might be to use gitoxide
for checking the status of the repository with starship (my assumption here is that gitoxide
might be faster than regular git for checking the status of the repository).
Thanks a lot joey for your help.
I gave it another try without setting the metadata and by using v4 index.
Instead of directly adding all files to the gateway repository, I distributed the files equally across the 16 nodes to make use of their resources.
On each node I added the file portion to a git-annex repository in order to merge them later via the gateway repository.
Adding the files on each node worked very well using the --jobs="cpus"
flag.
However, once I tried to merge all 16 repos using git-annex sync --no-content --allow-unrelated-histories --jobs="cpus"
all of the nodes crashed due to out-of-memory during this step:
remote: (merging synced/git-annex bigserver/git-annex into git-annex...)
I assume that you are right and that I simply have too many files.
Unfortunately, I currently cannot spend more time on investigating the issues.
Thanks again for your help.
Thanks a lot @joey & @nobodyinperson for your input
Yes exactly
I am still working on the code. But having a deadline is sometimes helpful 
I am using the approach proposed by you in this post: https://git-annex.branchable.com/forum/Storing_copies_on_LTO_tapes63/ As you noted, this is quite similar to how Glacier is handled.
And yes, it would also allow batching together multiple
git-annex get
into a single sequential pass over the tape. I would like to also support batching together objects originating from multiple git-annex repos.But this would make it pretty difficult to track the available capacity per tape cartridge as multiple git-annex repos would contribute (or even other non git-annex files).
LTO tapes are a bit special, as they are append-only. The available capacity will only decreases when new objects are added. The only option to regain capacity is by erasing the tape. If this happens, I am marking the git-annex remote as dead and initialize a new fresh remote.
I now realized, that I can use this fact to detect the first EOT (end of tape) error for each tape and then update its preferred content expression..
Oh that sounds really interesting. But how is this related to the
GETCOST
&GETAVAILABILITY
messages of the external special remote protocol?It seems like that the remote's cost could be a way to define the order in which the remotes are filled?
Its a lot to digest. I will start testing and playing around with your ideas.
Thanks