tips/How to retroactively annex a file already in a git repogit-annexhttp://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/git-annexikiwiki2016-11-24T11:27:59Zcomment 1http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_1_7eaf73fb3355bd706ab18a43790b3c10/edheil [wordpress.com]2013-11-27T22:47:37Z2012-12-16T00:11:38Z
Man, I wish you'd written this a couple weeks ago. <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /> I was never able to figure that incantation out and ended up unannexing and re-annexing the whole thing to get rid of the file I inadvertently checked into git instead of the annex.
comment 2http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_2_dac1a171204f30d7c906e878eb6bd461/arand2013-11-27T22:47:37Z2013-03-13T12:05:49Z
<p>Based on the hints given here I've worked on a filter to both annex and add urls via filter-branch:</p>
<p><a href="https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-filter">https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-filter</a></p>
<p>The script above is very specific but I think there are a few ideas that can be used in general, the general structure is</p>
<pre><code>#!/bin/bash
# links that already exist
links=$(mktemp)
find . -type l >"$links"
# remove from staging area first to not block and then annex
git rm --cached --ignore-unmatch -r bin*
git annex add -c annex.alwayscommit=false bin*
# compare links before and after annexing, remove links that existed before
newlinks=$(mktemp -u)
mkfifo "$newlinks"
comm -13 <(sort "$links") <(find . -type l | sort) > "$newlinks" &
# rewrite links
while IFS= read -r file
do
# link is created below .git-rewrite/t/ during filter-branch, strip two parents for correct target
ln -sf "$(readlink "$file" | sed -e 's%^\.\./\.\./%%')" "$file"
done < "$newlinks"
git annex merge
</code></pre>
<p>which would be run using</p>
<pre><code>git filter-branch --tree-filter path/annex-filter --tag-filter cat -- --all
</code></pre>
<p>or similar.</p>
<ul>
<li>I'm using <code>find</code> to make sure the only rewritten symlinks are for the newly annexed files, this way it is possible to annex an unknown set of filenames</li>
<li>If doing several git annex commands using <code>-c annex.alwayscommit=false</code> and doing a <code>git annex merge</code> at the end instead might be faster.</li>
</ul>
comment 3http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_3_b62ec0b848d2487d68d7032682622193/arand2013-11-27T22:47:37Z2013-03-18T14:39:52Z
<p>One thing I noticed is that git-annex needs to checksum each file even if they were previously annexed (rather obviously since there is no general way to tell if the file is the same as the old one without checksumming), but in the specific case that we are replacing files that are already in git, we do actually have the sha1 checksum for each file in question, which could be used.</p>
<p>So, trying to work with this, I wrote a filter script that starts out annexing stuff in the first commit, and continously writes out sha1<->filename<->git-annex-object triplets to a global file, when it then starts with the next commit, it compares the sha1s of the index with those of the global file, and any matches are manually symlinked directly to the corresponding git-annex-object without checksumming.</p>
<p>I've done a few tests and this seems to be considerably faster than letting git-annex checksum everything.</p>
<p>This is from a git-svn import of the (free software) Red Eclipse game project, there are approximately 3500 files (images, maps, models, etc.) being annexed in each commit (and around 5300 commits, hence why I really, really care about speed):</p>
<p>10 commits: ~7min</p>
<p>100 commits: ~38min</p>
<p>For comparison, the old and new method (the difference should increase with the amount of commits):</p>
<p>old, 20 commits ~32min</p>
<p>new, 20 commits: ~11min</p>
<p>The script itself is a bit of a monstrosity in bash(/grep/sed/awk/git), and the files that are annexed are hardcoded (removed in forming $oldindexfiles), but should be fairly easy to adapt:</p>
<p><a href="https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-ffilter">https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-ffilter</a></p>
<p>The usage would be something like:</p>
<pre><code>rm /tmp/annex-ffilter.log; git filter-branch --tree-filter 'ANNEX_FFILTER_LOG=/tmp/annex-ffilter.log ~/utv/scripts/annex-ffilter' --tag-name-filter cat -- branchname
</code></pre>
<p>I suggest you use it with at least two orders of magnitude more caution than normal filter-branch.</p>
<p>Hope it might be useful for someone else wrestling with filter-branch and git-annex <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
comment 4http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_4_2423904e41a86cd1c6bc155d7b733642/Stephen2013-11-27T22:47:37Z2013-06-22T07:43:09Z
<p>Thanks for the tip <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /> One question though: how do I push this new history out throughout my other Annexes?
All I managed to make it do was revert the rewrite so the raw file appeared again...</p>
large scale rewrite tipshttp://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_5_3062c0794ecd7c6237efae66f4d9b62f/Primiano2015-01-06T22:55:20Z2015-01-06T22:55:20Z
<p>I recently had the need of re-kind-of-annexing an unusually large repo (one of the largest?).
With some tricks and the right code I managed to get it down to 170000 commits in 19 minutes and extracing ~8GB of blobs.
Attaching the link here as I feel it might be helpful for very large projects (where git-filter-branch can become prohibitively slow)</p>
<p><a href="https://www.primianotucci.com/blog/large-scale-git-history-rewrites">https://www.primianotucci.com/blog/large-scale-git-history-rewrites</a></p>
Retroactively annexhttp://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_6_1eceb980814f95b28caac9d4d9894f01/Dilyin2016-06-01T18:36:40Z2016-06-01T18:36:40Z
<p>Hmm, guyz? Are you serious with these scripts?</p>
<ol>
<li>git rm -r --cached large_files
<h1>files are indexed as both removed and untracked and are still in place</h1></li>
<li>commit the changes
<h1>files are seen as untracked</h1></li>
<li>git annex add large_files
<h1>files are replaced with symlinks and are in the index</h1></li>
<li>commit changes again</li>
</ol>
<p>Make sure that you don't have annex.largefiles settings that would prevent annexing the files.</p>
"Hmm, guyz? Are you serious with these scripts?" Well, what's the matter?http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_7_603db6818d33663b70b917c04fd8485b/stephane-gourichon-lpad2016-11-15T10:58:32Z2016-11-15T10:58:32Z
<h2>Wow, scary</h2>
<p>Dilyin's comment is scary. It suggests bad things can happen, but is not very clear.</p>
<p>Bloated history is one thing.<br />
Obviously broken repo is bad but can be (slowly) recovered from remotes.<br />
Subtly crippled history that you don't notice can be a major problem (especially once you have propagated it to all your remotes to "recover from bloat").</p>
<h2>More common than it seems</h2>
<p>There's a case probably more common than people actually report: mistakenly doing <code>git add</code> instead of <code>git annex add</code> and realizing it only after a number of commits. Doing <code>git annex add</code> at that time will have the file duplicated (regular git and annex).</p>
<p>Extra wish: when doing <code>git annex add</code> of a file that is already present in git history, <code>git-annex</code> could notice and tell.</p>
<h2>Simple solution?</h2>
<p>Can anyone elaborate on the scripts provided here, are they safe? What can happen if improperly used or in corner cases?</p>
<ul>
<li>"files are replaced with symlinks and are in the index" -> so what ?</li>
<li>"Make sure that you don't have annex.largefiles settings that would prevent annexing the files." -> What would happen? Also <code>.gitattributes</code>.</li>
</ul>
<p>Thank you.</p>
Walkthrough of a prudent retroactive annex.http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/comment_8_834410421ccede5194bd8fbaccea8d1a/StephaneGourichon2016-11-24T11:27:59Z2016-11-24T11:27:59Z
<p>Been using the one-liner. Despite the warning, I'm not dead yet.</p>
<p>There's much more to do than the one-liner.</p>
<p>This post offers instructions.</p>
<h1>First simple try: slow</h1>
<p>Was slow (estimated >600s for 189 commits).</p>
<h1>In tmpfs: about 6 times faster</h1>
<p>I have cloned repository into /run/user/1000/rewrite-git, which is a tmpfs mount point. (Machine has plenty of RAM.)</p>
<p>There I also did <code>git annex init</code>, git-annex found its state branches.</p>
<p>On second try I also did</p>
<pre><code>git checkout -t remotes/origin/synced/master
</code></pre>
<p>So that filter-branch would clean that, too.</p>
<p>There, <code>filter-branch</code> operation finished in 90s first try, 149s second try.</p>
<p><code>.git/objects</code> wasn't smaller.</p>
<h1>Practicing reduction on clone</h1>
<p>This produced no visible benefit:</p>
<p>time git gc --aggressive
time git repack -a -d</p>
<p>Even cloning and retrying on clone. Oh, but I should have done <code>git clone file:///path</code> as said on git-filter-branch man page's section titled "CHECKLIST FOR SHRINKING A REPOSITORY"</p>
<p>This (as seen on https://rtyley.github.io/bfg-repo-cleaner/ ) was efficient:</p>
<pre><code>git reflog expire --expire=now --all && git gc --prune=now --aggressive
</code></pre>
<p><code>.git/objects</code> shrunk from 148M to 58M</p>
<p>All this was on a clone of the repo in tmpfs.</p>
<h1>Propagating cleaned up branches to origin</h1>
<p>This confirmed that filter-branch did not change last tree:</p>
<pre><code>git diff remotes/origin/master..master
git diff remotes/origin/synced/master synced/master
</code></pre>
<p>This, expectedly, was refused:</p>
<pre><code>git push origin master
git push origin synced/master
</code></pre>
<p>On origin, I checked out the hash of current master, then on tmpfs clone</p>
<pre><code>git push -f origin master
git push -f origin synced/master
</code></pre>
<p>Looks good.</p>
<p>I'm not doing the aggressive shrink now, because of the "two orders of magnitude more caution than normal filter-branch" recommended by arand.</p>
<h1>Now what? Check if precious not broken</h1>
<p>I'm planning to do the same operation on the other repos, then :</p>
<ul>
<li>if everything seems right,</li>
<li>if <code>git annex sync</code> works between all those fellows</li>
<li>etc,</li>
<li>then I would perform the reflog expire, gc prune on some then all of them, etc.</li>
</ul>
<p>Joey, does this seem okay? Any comment?</p>