In the name of protecting people from themselves I'd like to have an option to configure repositories on a Forgejo-aneksajo instance (or rather in general) to not immediately obey a git annex drop --from ... --force.
I am thinking of having an annex.delayeddrop config option (names subject to bike-shedding of course) to set in each repo's git config. With it set to e.g. "30d" git annex drop on that repository would, from the point of view of the user, do everything like always including recording that the repo no longer has the data, but instead of deleting the files immediately, move them into e.g. .git/annex/deleted-objects. This directory would then be cleaned of files that have been there for more than 30 days at some point in the future, e.g. when an fsck is done, or maybe on other operations too.
I don't think any tooling around ".git/annex/deleted-objects" would be necessary, rather with the information that the data for some key was lost one could then manually dive into that directory, retrieve the data out of it, and reinject it into the repository.
The point is to have a fast path to recovery from over-eager dropping that might otherwise lead to data loss, even though --force should be totally clear to everyone.
Or maybe something like this exists already...
The deletion could be handled by a cron job that the user is responsible for setting up, which avoids needing to configure a time limit in git-annex, and also avoids the question of what git-annex command(s) would handle the clean up.
An alternative way to handle this would be to use the "appendonly" config of
git-annex p2phttp(andgit-annex-shellhas something similar). Then the repository would refuse to drop. And instead you could have a cron job that usesgit-annex unusedto drop old objects. This would need some way to only drop unused objects after some period of time.I think there are some benefits to that path, it makes explicit to the user that they data they wanted to drop is not immediately going away from the server. Which might be important for legal reasons (although the prospect of backups of annexed files makes it hard to be sure if a server has really deleted something anyway). And if the repository had a disk quota, this would make explicit to the user why dropping content from it didn't free up quota.
(I think it would also be possible to (ab)use the
annex.secure-erase-commandto instead move objects to the directory. Probably not a good idea, especially because there's no guarantee that command is only run on complete annex objects.)A third approach would be to have a config setting that makes dropped objects be instead moved to a remote. So the drop would succeed, but whereis would indicate that the object was being retained there. Then a cron job on the remote could finish the deletions.
This would not be singifinantly more heavyweight than just moving to a directory, if you used eg a directory special remote. And it's also a lot more flexible.
Of course, this would make dropping take longer than usual, depending on how fast the object could be moved to the remote. If it were slow, there would be no way to convey progress back to the user without a lot more complication than this feature warrants.
Open to your thoughts on these alternatives..
Agreed, that makes sense.
While realistically most force drops probably would be unused files those two things aren't necessarily the same.
I think I would deliberately want this to be invisible to the user, since I wouldn't want anyone to actively start relying on it.
That's a tradeoff for sure, but the expectation should already be that a hosted service like a Forgejo-aneksajo instance will retain backups at least for disaster recovery purposes. But that's on the admin(s) to communicate, and within a personal setting it doesn't matter at all.
Actually for that reason I would not count this soft-deleted data towards quotas for my own purposes.
I like this! Considering that such a "trash bin" (special) remote could be initialized with
--private(right?) it would be possible to make it fully invisible to the user too, while indeed being much more flexible. I suppose the cron job would then be something likegit annex drop --from trash-bin --all --not --accessedwithin=30d, assuming that moving it there counts as "accessing" and no background job on the server accesses it afterwards (maybe an additional matching option for mtime or ctime instead of atime would be useful here?). This feels very much git-annex'y 🙂