Hello,
at first I want to thank for the work on git annex. I started using git annex for managing my media files (digital photo and video data), which means loads of raw photographs and big video files that I have spread over some harddisks and need to keep track of locations and backups. Works quite well so far.
A few years ago I build a small tool to offload camera memory cards with checksum/hash verification as it is common on film sets, copying all the content of the card and afterwards rereading and comparing all files to catch bitflips in transfer etc. All that while allowing copying to multiple target devices at the same time.
I'd now like to include this workflow with git annex, allowing me to copy a memory card onto two or more annexed directories with this copy & verify workflow while using this hash directly for git annex and having both copies checked into annex at the end instead of like now copying and adding to annex afterwards. What would be an elegant solution?
My naive thinking would be to move the file to the .git/annex/objects/ folder after hash creation and get it correctly registered in the annex management data somehow. Reason is to reduce the time consumed as I have the copy & verify anyway and would prefer to have it annex ready with as little overhead as possible.
Cheers, Ingmar
Both git-annex setkey and git-annex reinject check the hash of the file before moving it into the annex. But you can set annex.verify=false to prevent setkey from hashing.
Eg, if you know you have a file
/mnt/foo
with size 11 and SHA1 hash x, you can use this shell code to add the file to the git-annex repository:To scale this to handle a lot of files, you can use the --batch option to examinekey to avoid starting a lot of processes. There is not currently a --batch option for setkey (maybe there should be), but setting annex.alwayscommit=false will speed up repeated runs of it some.
Hey, thank you for those comments.
Tried your example joey and it seems to do exactly what I was thinking of. Now I have the roadmap I will adapt my tool to.
Just a few more small questions:
1.) How can I trigger the (recording state in git...) after throwing in files with git -c annex.verify=false -c annex.alwayscommit=false annex setkey ...?
2.) Can setkey be used on multiple repositories for the same file and then have the file exists in multiple copies after this without an annex copy? (Of course only one symlink generation and git add)
So thanks again for the fast replies.
-- Ingmar
"How can I trigger the (recording state in git...) after throwing in files with git -c annex.verify=false -c annex.alwayscommit=false annex setkey ..." -- manpage says "You can later commit the data by running git annex merge (or by automatic merges) or git annex sync. You should beware running git gc when using this configuration, since it could garbage collect objects that are staged in git-annex's index but not yet committed." I myself was wondering if there is a more lightweight way of doing this.
"Can setkey be used on multiple repositories for the same file and then have the file exists in multiple copies after this without an annex copy?" -- I think setkey moves the file into annex? git-annex-import has a
--duplicate
flag to preserve the original, but setkey doesn't. Check out also local caching of annexed files for how to use one annex as backing storage for several repos.Ok, thanks.
My second question was formulated not quite well. Of course I meant multiple checkouts of this one repository. That in the end is my use case that I have multiple checksum verified copies of all my images/movie clips in one go.
Did quite a bit of testing over the past weeks and stuff seams working well so far.
Here is a sample bash script I used to experiment that shows the workflow if anyone is interested. (Just an example, seems to do basically the same as git annex add $FILE as it is):
Kind regards, Ingmar
Great that it's working.
Not sure if relevant to your case, but one other way to have multiple checkouts of one repo is Using git-worktree with annex. Each checkout must be of a different branch though. On the plus side, they all share one annex, so there is only one copy of each annexed file (for locked files).