avoid rehashing when converting existing backups into new remotes

git-annex/ forum/ avoid rehashing when converting existing backups into new remotes

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

I have two copies of a large collection of photos on two computers. I have been using rsync to keep them synchronized, but I would like to use git annex instead. I finished initializing one repo, and now its time to initialize the other repo before I can sync them up. But it seems redundant that the other repo will compute its own hashes although I know that the two contents are identical. Can't I just copy the keys and tell git annex to assume that any files seen have already been hashed?

As a start, I cloned the .git from the first machine to the second. But that wasn't enough ... I now have an empty git-annex repo in the second machine. What else is missing?

RSS Atom

I'm also having this issue

I have the exact same issue: two copies of photos and videos that I have been using rsync to maintain. Converted one disk to git-annex, cloned the repo and moved .git/ into the second but can't seem to find a fast way to move the files to the correct hash id in the annex dir and create the symlinks. I'm thinking it might be best to just do a git init; git annex init; git annex add .; git commit then add the remotes in both directions and "sync".

Comment by Jason — Mon Dec 14 03:17:58 2015

Remove comment

It worked

The above process just completed. Of course it took a while to hash and create symlinks for the 20K+ files. The initial sync caused an expected "Warning: no common commits" however since the symlinks all pointed at the exact same hashed filenames there was nothing to merge. The sync did take several minutes, but it was much shorter than copying 120GB. I hope this helps someone else in the future.

Comment by Jason — Mon Dec 14 04:46:46 2015

Remove comment

comment 3

I'd say that method, or any similar set of steps, is the typical way to handle this.

Sure, everything gets hashed twice. This is unlikely to waste enough time to make it worthwhile to develop a hack that only hashes once.

If you really want to develop such a hack, the plumbing command that you can use to make it happen is git annex setkey. So, you'd add all the files to the first repository, and then use git-annex find --format="${key} ${file}" to list all the files and the keys that resulted from hashing them. Then in the second repository, you'd use that list to run git annex setkey and force the files into the annex without hashing them.

This will probably turn out to be slower than just re-hashing the files would be, since you'll have to run git annex setkey once per file. Adding a --batch option that reads from stdin would probably be called for to get it fast enough to bother with. Although passing -c annex.alwayscommit=false might speed it up enough.

Comment by joey — Sat Dec 19 18:09:26 2015

Remove comment

comment 4

This is unlikely to waste enough time to make it worthwhile to develop a hack that only hashes once.

In my super-heavy use case, the second hashing of the files is dwarfed by the 45 minute wait for git to update .git/index, so I would agree with this.

Comment by CandyAngel — Wed Dec 23 12:46:36 2015

Remove comment

Add a comment

Last edited Sun Dec 1 11:55:56 2013