In a perfect world, we could hoard infinite amounts of data but unfortunately, our world is a finite one and we need to delete stuff sometimes.
Current situation
You can drop files but that's seems to be intended for when you want to get rid of singular copies of a file in some repos, while copies in other repos remain untouched.
Dropping files with --force
works but only drops from the current repo and thus needs to be run for each file on every remote which is cumbersome and, more importantly, error-prone.
You can also drop keys using dropunused
but the same downsides apply here and you might want to keep old version of all the other files around (that's kinda the point of managing them with git).
Dropping files completely like this also makes fsck complain about their inexistance. Fsck shouldn't care about intentionally deleted files.
You can dead
unwanted files to make fsck semi-ignore them but that should be reserved for unintentionally lost files and is a bit cumbersome (git annex dead --key $(readlink path/to/file) && git annex drop path/to/file && git rm path/to/file
). It also still doesn't propagate deletion to remotes; they will (and should!) hold on to copies of seemingly dead keys.
Thus, there needs to be a new specific workflow for intentionally deleting files.
A user needs to be able to express: "These files, I don't need them anymore. Delete them from all my repos please.".
Ideally, this should be a very explicit action to avoid accidental deletion.
What's important though is that this workflow doesn't involve manually running magic git annex commands for every deleted file on all possible remotes (especially not hard to reach ones). An explicit deletion should be final, irreversible.
Implementation
I'd personally like to see a recycle bin in the form of a special (gitignored?) directory inside the working tree where deleted files would land. One could then either delete the files from the bin (thus marking them for final deletion) or move them back into the repo, cancelling deletion. (Any copy present in the regular working tree should imply that the file is wanted.)
This would be quite a simple workflow from a user's perspective IMO and would mimic the behaviour of all major desktop environments (Windows, OSXI, GNOME, KDE).
Since this was posted, fsck has stopped complaining about files dropped with
dropunused
.If a repository is not accessible, this is difficult to implement.
It seems that the closest we can get to implementing it is something like
dropunused
, which can be run in that inaccessible repository at some later point when it's accessible, and catch up on dropping all the files that have become unwanted while it was inaccessible.One way to do that without relying on the idea of "unused" would be to tag a file with metadata saying its content ought to be deleted from everywhere. That is possible to do now, eg:
That drop can be run in every clone over time to delete all the tagged files.
I could imagine formalizing this ad-hoc tag into something standard in git-annex. Perhaps similar to how dead files are currently indicated. But one problem with it is it may not play well in multiuser environments where people have different ideas about what files they want to delete all copies of. If two users are using dropunused and have a disagreement, they will have 2 different branches, which are forked, and neither will step on the other's toes when they run dropunused against their branches and drop content that is still used on the other person's branch. But a tag like "deletethis" is repository global.
Thank you!
This is the best option I think; some sort of flag you can set on a key that marks it as unwanted and propagates via the git-annex branch. The actual deletions could then be carried out on the individual repos by using a dedicated command (
dropdeleted
?), by the assistant or perhaps even usingsync --content
.The important bit is that it shouldn't be synchronous or depend on the repository being reachable directly though; it should be recorded and propagated asynchronously.
E.g. in any tree of repos with assistants running (so, no transitive connections), marking a key as deleted in any one of them should result in the key being deleted from all of them.
Ah, I didn't consider that. A recycle bin with tracked files is likely infeasible but an untracked one could still be valuable. That's a topic for another issue though.