forum/Blacklisting files (so that they get removed any time a copy is found)git-annexhttp://git-annex.branchable.com/forum/Blacklisting_files___40__so_that_they_get_removed_any_time_a_copy_is_found__41__/git-annexikiwiki2017-03-10T08:11:13ZSketch of implementation, request for commenthttp://git-annex.branchable.com/forum/Blacklisting_files___40__so_that_they_get_removed_any_time_a_copy_is_found__41__/comment_1_a224bf8ed9502053e9b95317db7b4bfa/stephane-gourichon-lpad2016-10-01T17:25:46Z2016-10-01T17:25:46Z
<p>From forum entry <a href="http://git-annex.branchable.com/forum/git_annex_drop_not_freeing_space_on_filesystem/">git annex drop not freeing space on filesystem</a> I understand that:</p>
<ul>
<li>removing a file from the tree hierarchy (with <code>git rm</code>, maybe other ways) opens the possibility of freeing space locally</li>
<li>syncing with remotes propagates this possibility</li>
<li><code>git annex dropunused --force</code> can free space</li>
</ul>
<p>So, if we have a way to remove files by content (through git, git annex, etc), e.g. starting from http://stackoverflow.com/questions/460331/git-finding-a-filename-from-a-sha1 , we can sketch a naive implementation:</p>
<ul>
<li>(option 1, provides Q1 and Q2 above) run <code>git ls-tree</code> and filter output for blacklisted hashes (anyone knows a way to efficiently filter lines matching a pattern from a potentially big collection?), run git rm on the paths obtained</li>
<li>(option 2, provides Q3 above) run <code>git gilter-branch</code> with a filter spec that removes files by hash (is that possible?)</li>
<li>git sync with remotes</li>
<li>on each remote <code>git annex dropunused --force</code></li>
</ul>
<p>Does that make sense?</p>
<p>Thank you.</p>
comment 2http://git-annex.branchable.com/forum/Blacklisting_files___40__so_that_they_get_removed_any_time_a_copy_is_found__41__/comment_2_0f2e7cf29ad9d41e0041924e69425050/joey2016-10-04T15:30:46Z2016-10-04T15:17:49Z
<p>The answer to this question depends on how quickly you need to get rid of
the files, and whether you want git-annex to generally keep the content of
deleted and old versions of modified files or not.</p>
<p>If you only want git-annex to keep the content of files that are present in
the current branches and tags, then simply delete a file and commit will
work. Later using <code>git annex unused</code> to finish getting rid of the content
of deleted files that are not used by any branches/tags.</p>
<p>If you want to keep the full history in general, but drop the content of
specific files, then you need to use git-annex to drop the content before
you delete the file. You can use <code>git annex whereis $file</code> to see everywhere that
the content has gotten to, and then <code>git annex drop $file --from</code> each location
(and from the local repository).</p>
<p>If you need to immediately get rid of the content of some file, you can use
the same procedure to check where it is and drop it from those locations.</p>
<p>You don't need to filter old commits out of branches to use <code>git annex
unused</code>; it only looks at the most recent commit in each branch, so once a
file has been deleted from all branches it will be identified as unused.</p>
Work-in-progress, yet already usable, solutionhttp://git-annex.branchable.com/forum/Blacklisting_files___40__so_that_they_get_removed_any_time_a_copy_is_found__41__/comment_3_a2cac051ef2d36e1ffb550c4a788c111/stephane-gourichon-lpad2017-03-09T12:19:49Z2017-03-09T12:19:49Z
<p>TL;DR: Thanks for clarifying the possibilities. I chose to maintain an explicit blacklist of files that are definitely uninteresting, because it is safer and more convenient at several level (protect against mistakes, makes reviewing easier, fewer assumptions). I share the proof-of-concept here in case anyone is interested in the future. I might share some scripts on a public repo, especially if anyone shows interest.</p>
<h2>Discussion</h2>
<blockquote><p>You don't need to filter old commits out of branches to use git annex unused; it only looks at the most recent commit in each branch, so once a file has been deleted from all branches it will be identified as unused.</p></blockquote>
<p>Okay, this is important to know (perhaps something to add to the documentation there).</p>
<blockquote><p>(...) depends on how quickly you need to get rid of the files, and whether you want git-annex to generally keep the content of deleted and old versions of modified files or not.</p>
<p>If you only want git-annex to keep the content of files that are present in the current branches and tags (...)</p></blockquote>
<p>In a perfect world, that would be enough.</p>
<blockquote><p>If you want to keep the full history in general, but drop the content of specific files, (...)</p></blockquote>
<p>I realize that I need to keep the full history in general, because (it already happened) although I almost always make small independent commits and review changes before commit, large changes are relatively common. For example renaming a directory that has many sub-directories full of files. If some files are removed by mistake since the last commit, the information is drowned in a big change. As git does not always track renames, it sometimes shows many additions/removal. Also, changes are sometimes committed by a <code>git annex sync</code> (which I avoid for this reason).</p>
<p>So, a file reported by <code>git unused</code> is not a proof that it should be deleted.</p>
<p>So all in all, yes I "want to keep the full history in general, but drop the content of specific files".</p>
<blockquote><p>then you need to use git-annex to drop the content before you delete the file.</p></blockquote>
<p>Okay, this works locally. Yet I have started to work another way, see below.</p>
<blockquote><p>You can use git annex whereis $file to see everywhere that the content has gotten to, and then git annex drop $file --from each location (and from the local repository).</p>
<p>If you need to immediately get rid of the content of some file, you can use the same procedure to check where it is and drop it from those locations.</p></blockquote>
<p>This somehow assumes connectivity to other repositories at cleanup time, which is not practical.</p>
<p>I might make a number of commits including some that locally prune hundreds of files, several times, before having access to any other repository.</p>
<p>This does not leave a clear state of "globally blacklisted" file.</p>
<p><strong>It seems to me that the situation needs maintain some state allowing to defer the removal to the time when repositories are available.</strong></p>
<h2>Work-in-progress solution</h2>
<p>This evolved to working another way:</p>
<ul>
<li>maintain (grow) a list of blacklisted keys that's stored somewhere in the repository and synced like any other (regular git) file.</li>
<li>have some script that can be run at any time from any repo and locally delete any file that's blacklisted.</li>
<li>variant: before locally dropping the file, check if, as far as this repo knows, it's really unused anywhere.</li>
</ul>
<h3>Proof-of-concept implementation</h3>
<h4>Script <code>git_annex_drop_all_files_in_blacklist.sh</code></h4>
<p>Good for a small blacklist when <code>git annex unused</code> is slow.</p>
<pre><code>#!/bin/bash
set -eu # Safety, abort on error.
while read KEY REASON
do
echo Will drop blacklisted "$KEY"
git annex dropkey --force "$KEY"
done <blacklist
</code></pre>
<h4>Script to show all unused files</h4>
<p>This presents all unused files in a directory.
They can be presented ordered by date or EXIF date for review.</p>
<pre><code>#!/bin/bash
mkdir -p review_unused ; cd review_unused ; time git annex unused | while read NUMBER KEY ; do ln -s $( git annex contentlocation $KEY ) $KEY ; done ; xdg-open . &
</code></pre>
<p>Files can then be interactively moved to other directories for further processing: reuse with <code>git add</code>, blacklist, etc.</p>
<h4>Other</h4>
<p>I have also written another experimental script that processes the output of <code>git annex unused</code>, automatically blacklists them in some specific cases and assists in reviewing or <code>git annex drop</code>ing them.</p>
<h2>Additional benefit</h2>
<p>Storing state in a blacklist has the advantage of being very explicit and global.</p>
<p>This has the advantage that the blacklist can potentially apply to content outside of any repository.
For example, I find one of my memory cards for digital camera, that I've not used for a long time. It has old copies of some photos. The blacklist allows to delete immediately uninteresting files, then git-annex can tell if the remaining files are known and can be also deleted. Then, either the card becomes empty, or only the (fewer) remaining files can be manually reviewed. Quick and safe.</p>
comment 4http://git-annex.branchable.com/forum/Blacklisting_files___40__so_that_they_get_removed_any_time_a_copy_is_found__41__/comment_4_d95647d91a33d1164a87344939cb769a/CandyAngel2017-03-10T08:11:13Z2017-03-10T08:11:13Z
<p>I use git-annex-metadata to tag files I don't want, then drop the content with that tag[1]. Then it doesn't matter if the content reappears, it'll get pruned at some point. Also easy to find any symlinks pointing at the unwanted content using git-annex-find <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
<p>[1] I actually move it to another remote on a 3TB drive which only has such unwanted content, and then delete it on there when I need more space on it. Doesn't matter if that drive fails, but allows a bit of a window for an "actually, I do need that" recovery.</p>