Hello,
I'm not sure this is mature enough as a wishlist item so I put it in the forum. Please move if it is appropriate.
Context, general need
I'm considering using git-annex to store photos, yet what I'm needing might be useful in more general cases.
The need in one sentence: deal with files that change status from "precious, please keep n copies" to "junk, please delete it from everywhere you find it, now and forever".
More concrete need
I take many photographs and am interested in git-annex comfort to ensure a number of copies of files exist at any time.
Sometimes I can sort photos and reject bad ones shortly after taking a roll, typically before I could git annex add
them. Sometimes I do it much later, after they have been included in git-annex
and stored in several places.
Rationale 1: releasing storage space
I'm worried about having a lot of space still used by photographs that I actually don't want to keep.
It's not marginal. For example, at 30Mb per RAW photo, 300+ photos in an ordinary afternoon take 10Gb, from which up to one half could turn out rejected. Whole photo/video collection is already probably much over 1Tb and growing.
So we're talking about 5Gb per shooting day to be freed and already probably 100+Gb to free from remotes, so definitely something to consider.
Rationale 2: rejecting files once and for all, not having to repeat it
Once I rejected a photograph, I'd like to never have to rm
, forget
, drop
them or whatever again. Ideally it would just be dropped from any back-end (that supports removing information) at any time in the future, perhaps with just a generic command/script like git annex purge_the_junk_files
.
Questions
- Q1 Is there a simple way to somehow "blacklist" a particular file so that from that moment on, any
sync
-like operation ofgit-annex
involving a remote will remove the data of this file in that remote. - Q2 UI considerations. In my dreams, files could just be moved to a "junk" sub-folder (using photo sorting tools like e.g. geeqie), then a batch command would assign them the "blacklisted" or "junk" status.
- Q3 I don't mind at this point if there are traces of where (in the filesystem tree) the now-rejected files were. Perhaps it's easier to perform a different operation, that is completely forgetting the history of blacklisted files in the spirit of
git filter-branch
.
Perhaps most of this can be done with simple configuration and a helper script.
Additional information
- I'm wondering if some simple scheme like an actual
git filter-branch
in the localgit-annex
repo then some cleanup command (git annex fsck
) then push to remote would have the intended effect. Since this involves rewriting history I don't know howgit-annex
would like it. But that's just a thought-experiment at this point. - The number of files to blacklist can quickly go to a few hundreds later thousands. This might rule out some very naive implementations.
- I can hack a bash script to perform whatever appropriate command so that given a solution to Q1 I have a solution to Q2.
Search before you post
I've found other topics but they don't seem to even remotely deal with the topic here. E.g.:
- https://git-annex.branchable.com/forum/Backing_up_photos_to_the_cloud/
- https://git-annex.branchable.com/forum/best_practices_for_importing_photos63/
- https://git-annex.branchable.com/tips/automatically_adding_metadata/
https://git-annex.branchable.com/git-annex-forget/ seems to be orthogonal (forget history of everything, but not delete data)
This might be more-or-less on topic but confusing to me at this point : https://git-annex.branchable.com/forum/How_to_get_git-annex_to_forget_a_commit63/
Thank you for your attention.
From forum entry git annex drop not freeing space on filesystem I understand that:
git rm
, maybe other ways) opens the possibility of freeing space locallygit annex dropunused --force
can free spaceSo, if we have a way to remove files by content (through git, git annex, etc), e.g. starting from http://stackoverflow.com/questions/460331/git-finding-a-filename-from-a-sha1 , we can sketch a naive implementation:
git ls-tree
and filter output for blacklisted hashes (anyone knows a way to efficiently filter lines matching a pattern from a potentially big collection?), run git rm on the paths obtainedgit gilter-branch
with a filter spec that removes files by hash (is that possible?)git annex dropunused --force
Does that make sense?
Thank you.
The answer to this question depends on how quickly you need to get rid of the files, and whether you want git-annex to generally keep the content of deleted and old versions of modified files or not.
If you only want git-annex to keep the content of files that are present in the current branches and tags, then simply delete a file and commit will work. Later using
git annex unused
to finish getting rid of the content of deleted files that are not used by any branches/tags.If you want to keep the full history in general, but drop the content of specific files, then you need to use git-annex to drop the content before you delete the file. You can use
git annex whereis $file
to see everywhere that the content has gotten to, and thengit annex drop $file --from
each location (and from the local repository).If you need to immediately get rid of the content of some file, you can use the same procedure to check where it is and drop it from those locations.
You don't need to filter old commits out of branches to use
git annex unused
; it only looks at the most recent commit in each branch, so once a file has been deleted from all branches it will be identified as unused.TL;DR: Thanks for clarifying the possibilities. I chose to maintain an explicit blacklist of files that are definitely uninteresting, because it is safer and more convenient at several level (protect against mistakes, makes reviewing easier, fewer assumptions). I share the proof-of-concept here in case anyone is interested in the future. I might share some scripts on a public repo, especially if anyone shows interest.
Discussion
Okay, this is important to know (perhaps something to add to the documentation there).
In a perfect world, that would be enough.
I realize that I need to keep the full history in general, because (it already happened) although I almost always make small independent commits and review changes before commit, large changes are relatively common. For example renaming a directory that has many sub-directories full of files. If some files are removed by mistake since the last commit, the information is drowned in a big change. As git does not always track renames, it sometimes shows many additions/removal. Also, changes are sometimes committed by a
git annex sync
(which I avoid for this reason).So, a file reported by
git unused
is not a proof that it should be deleted.So all in all, yes I "want to keep the full history in general, but drop the content of specific files".
Okay, this works locally. Yet I have started to work another way, see below.
This somehow assumes connectivity to other repositories at cleanup time, which is not practical.
I might make a number of commits including some that locally prune hundreds of files, several times, before having access to any other repository.
This does not leave a clear state of "globally blacklisted" file.
It seems to me that the situation needs maintain some state allowing to defer the removal to the time when repositories are available.
Work-in-progress solution
This evolved to working another way:
Proof-of-concept implementation
Script
git_annex_drop_all_files_in_blacklist.sh
Good for a small blacklist when
git annex unused
is slow.Script to show all unused files
This presents all unused files in a directory. They can be presented ordered by date or EXIF date for review.
Files can then be interactively moved to other directories for further processing: reuse with
git add
, blacklist, etc.Other
I have also written another experimental script that processes the output of
git annex unused
, automatically blacklists them in some specific cases and assists in reviewing orgit annex drop
ing them.Additional benefit
Storing state in a blacklist has the advantage of being very explicit and global.
This has the advantage that the blacklist can potentially apply to content outside of any repository. For example, I find one of my memory cards for digital camera, that I've not used for a long time. It has old copies of some photos. The blacklist allows to delete immediately uninteresting files, then git-annex can tell if the remaining files are known and can be also deleted. Then, either the card becomes empty, or only the (fewer) remaining files can be manually reviewed. Quick and safe.
I use git-annex-metadata to tag files I don't want, then drop the content with that tag[1]. Then it doesn't matter if the content reappears, it'll get pruned at some point. Also easy to find any symlinks pointing at the unwanted content using git-annex-find
[1] I actually move it to another remote on a 3TB drive which only has such unwanted content, and then delete it on there when I need more space on it. Doesn't matter if that drive fails, but allows a bit of a window for an "actually, I do need that" recovery.