Hi,
We have several large git annex repos where all of the files are on remotes and we want to got through and clean up the repositories by deleting some subset of files.
What is the fastest way to permanently delete files from a git annex repository with remotes?
I guess I can to git annex drop --numcopies=0 <file>; git rm <file>
. Does that actually delete the file permanently?
Is there a faster way?
Thanks,
Mike
I experimented with this by making an empty directory with two empty files and one file with some content. I added them all, then ran
git annex drop --numcopies=0 <file>; git rm <file>
on one of the empty files.Interestingly, what happened is that git annex deleted the empty file from .git/annex/objects, but left the directory structure. In this case the link pointed to:
.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
After the drop command what was left was the following empty directory:
.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/
Also interestingly and terrifyingly, because there were two empty files, both pointed to the same object, the
git annex drop
command deleted the file in the objects directory, and now the second link points to nothing. The file is done.This means that if you have a git annex repository and you have two copies of a file, and you think to yourself, "oh, let me just delete one, I don't need two", and you use the method above, you will permanently and irrevocably delete both files. Not good.
Any better ideas on how to do this?
Tried another approach:
git annex unannex <file>; rm <file>
This does not delete the original, and it only works if you do
git annex get <file>
first. It won't update the remote, unless you cd into that remote and rungit annex sync
there. After that there is the illusion the file is done, but its content is still in .git/annex/objects. In my test case I could vim into the file in question in the objects directory and it was still there.So
git annex drop
deletes both copies of duplicate files and so is too dangerous to use andgit annex unannex
doesn't delete the file anywhere. I am a little stuck here, what do I do?OK, I should have read more before writing.
It seems like the procedure is described here: http://git-annex.branchable.com/walkthrough/unused_data/
The process is: 1. rm files or directories 2. git annex sync in all remotes (that is a pain, I wish I only had to do it once) 3. git annex unused 4. git annex dropunused 5. Repeat 4 and 5 on any repository where the data is stored
That does work for me, it is just slightly cumbersome. If there is another way or if I am missing something, please let me know.
Thanks,
Mike
This post is misplaced, it is not a tip about how to use git-annex, but a question. I will be moving it to the forum after posting this comment.
The right answer is probably to run:
git annex drop $file
, with no --numcopies, no --force, etc. Just let git-annex do its job; it will check the remotes to ensure that enough copies of the file exist to make it safe to drop the content of the file from the local repository. ( Note that --numcopies=0 is very unsafe, you're asking git-annex to delete even the last copy of your data without checking when you do that.)If your goal is to get rid of every copy of this file from every repository that has a copy, I suggest just
git rm $file; git commit
, followed by runninggit annex unused
in the various repositories to clean them up.There is a faster way, which is to run
git annex drop --from $remote
for each remote that has the file. If you want to get rid of every copy of the file, for sure, you could add a --force to that.git-annex deduplicates data, so it's completely expected that if two files have the same content, dropping one will remove the content of the other.
I cannot reproduce any .git/annex/objects/foo empty directories being left behind by git-annex after doing that. Perhaps you are not using a current version of git-annex?
Sorry about the misplacement, that was a complete accident.
What I am trying to do is to delete files as quickly as possible from every repository. In this case we are using git annex to move non-critical data from our main RAID drive to an external drive while still maintaining the full directory structure on the RAID drive. This is very valuable because we sometimes won't need the data for months or years, but then we may suddenly need a few files, and git annex makes it very easy to get them back. But we are talking about many terabytes and thousands and thousands of files here, and sometimes we just want to completely get rid of that data, it just takes up too much drive space. I wanted to make it as easy and safe as possible for people to just delete files from every repository, hence the question.
I am nervous about using
git annex drop --force
because it seems to me that if there are two identical copies of a file in a repository, that command will kill the content of both... or does that only happen withgit annex drop --numcopies=0
?I think the best solution for me seems to be the
git rm <file>; git annex unused; git annex dropunused; git annex sync
series of commands. It would just be nice if it were possible to achieve the same results in every repository with a simple command such asgit annex rm --all <file>
. I recognise that this would be a dangerous command, but frankly I feel like in linux, everyone should be aware just how dangerousrm
is in every context