I have come up with a moderately complex solution to a particular use case that I have and am posting it here in case it is useful to someone else, and to get suggestions on how to improve it.
The problem:
I have a large number of files that are accessed infrequently and stored off-line on DVD-Rs. I need to keep track of which files are on which disc so that when I want a file I can find it.
The solution:
I currently keep a text file to track which files are on which discs. I would like to organize all the files in a proper filesystem using git annex, allowing me better organization and the ability to keep some smaller related files online near the annexed large files.
Requirements:
1) Easily locate the DVD-R containing any specific offline file
This is easily taken care of with git annex whereis
2) Automatically de-duplicate stored files with the same contents
This is taken care of with one of the hash backends (E.G. SHA256)
3) The DVD-Rs still need to be usable without git or git-annex (E.G. The stored files should retain their normal human readable names)
This requirement rules out dir and rsync special remotes, they store the files named according to their hash. I have settled on making each disc a separate repo which will satisfy this requirement.
Future goals:
4) Easily incorporate the current DVD-Rs into the new system
I haven't found a way to fulfill this goal yet. I have some convoluted ideas, but nothing so easy as mount disc, run git annex command.
The solution in detail
Suppose you have the following tree:
~/mainrepo/thing1/file1.bin ~/mainrepo/thing1/description1.txt ~/mainrepo/thing2/file2.bin ~/mainrepo/thing2/description2.txt
You want to store thing1 on disc1 and thing2 on disc2, but you'd like to keep the descriptions online because they are small and useful for figuring out which thing you want later.
1) Create the main repo and annex the files:
cd ~/mainrepo git init git annex init mainrepo git annex add . git commit -m 'added files'
2) Create two new unrelated repos and populate them with their respective data and annex:
cd /tmp mkdir disc1repo disc2repo cd disc1repo cp ~/mainrepo/thing1/* . git init git annex init disc1 git annex add . git commit -m 'added files' cd ../disc2repo cp ~/mainrepo/thing2/* . git init git annex init disc2 git annex add . git commit -m 'added files'
3) This is optional, but after annexing the files in these new repos, I replace the symlinks pointing into to the .git/annex/objects/ directory with hard links. This makes the DVD-Rs usable from operating systems that can't deal with symlinks. (mkisofs handles hard links correctly)
cd /tmp find disc1repo/ disc2repo/ -type l -execdir sh -c "mv -iv {} {}.symlink && ln -L {}.symlink {} && rm {}.symlink" \;
4) Burn these repos onto DVD-Rs:
cd /tmp #make isos mkisofs -volid disc1 -rational-rock -joliet -joliet-long -udf -full-iso9660-filenames -iso-level 3 -o disc1.iso disc1repo/ mkisofs -volid disc2 -rational-rock -joliet -joliet-long -udf -full-iso9660-filenames -iso-level 3 -o disc2.iso disc2repo/ #burn the isos (untested command) cdrecord -v -dao disc1.iso cdrecord -v -dao disc2.iso
5) Mount the DVD-Rs and add as a remote and fetch, then drop from the mainrepo:
cd ~/mainrepo #disc1 mount /mnt/cdrom git remote add disc1 /mnt/cdrom git fetch disc1 git annex drop thing1/thing1.bin umount /mnt/cdrom #disc2 mount /mnt/cdrom git remote add disc2 /mnt/cdrom git fetch disc2 git annex drop thing2/thing2.bin umount /mnt/cdrom
6) Enjoy! You can now find out what disc things are on simply using git annex whereis, and you can git annex get them or simply use them directly from the disc.
I'd appreciate any comments and helpful suggestions. Especially how to simplify the process or easily integrate all the things I already have stored on discs.
Maybe it would be possible to create a special remote using the hooks for the DVD-Rs.
Even though it is a bit tedious and complicated, the current process could be automated using a script.
http://dar.linux.free.fr/doc/index.html
Would be nice to have this as another remote option for git-annex, since I too would like to have static (and possibly incrementally extended) remotes that span multiple DVDs
dar looks familiar, I'm sure I have run across it in the past. However, it is not suitable in this case; see requirement #3 above that the DVD-Rs be usable without git or git-annex.
What would work we be some sort of special remote that allows free-form data. Imagine that you create the DVD-R with the files on it, then you mount it and add the mount directory as a free-form special remote. git-annex checksums all the files under the specified directory and stores the relative path to each file somewhere. Then, when you want to fetch a specific hash from the remote it looks up the relative path, adds it to the base directory and transfers it into the local .git/annex/objects/ store.
I have already stored a lot of large files on DVDs. I did that for arhiving, so I cared that there are several copies. But I want this to be more automated.
I take my disc (or one created by someone else, without any knowledge of Git), checksum its contents in git-annex, and in the projects where I'm using this content, I can check that the file is archived on at least N discs.
Also, I might enhance the content -- this would be refected in a Git commit, so then I want also to be able to check that the new version has also ben archived on severeal discs.
A special remote for such free-form read-only media would be very convenient.
This is starting to get interesting. A free-form remote would definitely simplify my use case, and also solve the "future goal" of easily incorporating my already existing DVD-Rs.
I haven't really looked into the git-annex internals up to this point, but looking at the hook page there doesn't seem to be a hook for init which would be needed to populate git-annex's index of files in the remote. (git-annex seems to assume that new special remotes are empty)
Another problem is where to store the hash to path relation information. On a RW remote it would be stored in the remote, but here we need to keep it in the repo somehow. This could be in the git-annex branch, or possibly another branch created specifically for this purpose.
1) initremote needs to:
2) store and remove should just fail.
3) retrieve and check present seem straight forward.
The assistant blog mentions adding support for read only remotes but I don't know anything about it: day 65 transfer polish (I'm still on 3.20120605)
Let me know if there is anything I haven't thought of yet.
I encourage playing around with the hook special remote and see how far you can make it go.
I may be doing something vaguely like this for desymlink, although I'm pretty sure it would still have a git repository associated with the directory of regular files.
One option is to use the web special remote, with file:// urls. Assuming a given disc will always end up mounted somewhere stable, such as /media/dvd1, /media/dvd2, etc, you could then just
git annex addurl file:///media/dvd1/$file
.git annex whereis
will show the url, which has enough info to work out the disk to mount.The web special remote did not support file:// urls, but I've just fixed that. The only downside is that, while it will identify files duplicated across disks, and
whereis
will show multiple urls for such files, there's only one web special remote, and so it only counts as 1 copy. This could perhaps be improved; git-annex may eventually get support for remotes reporting how many copies of a file they contain.This works great! I first tried it with WORM, no-go. I can see why the SHA backends are so powerful, they appear to circumvent the commits which git usually uses for merging. When I first do the merge, it reports this:
warning: no common commits
Compared to how I've managed CD/DVD backups in the past, this is a quantum leap forward, and I don't find it convoluted in comparison. Yes, there is dar, but I prefer this method. In my case, its the perfect solution for original files, which in generally are treated as immutable, and not accessed very often. They are usually large, too! I'm using them for digital pictures.
Hi Joey,
Thanks for the advice. I had thought of the web special remote; but as you may have noticed from my example, I don't use automount so my DVDs and CDs all get mounted in the same place. (/mnt/cdrom) so the web special remote won't work for me.
I'll try to play around with the hook special remote this weekend. I had a thought it might be interesting to have it search for the DVDs in some common places or even by parsing the mounted file systems, and allow an override or augmentation through git config.
Albert,
Thanks for feedback! I'm glad that somebody else found the method I worked out useful. As I'm going to try and turn it into a proper special remote, let me know if there is any particular use case or feature you'd like me to address.
Note that in my testing, I found that you don't actually need to merge the DVD's branch into the local branch you are using for git annex to be able to find the files on it that are identical to files in your local branch.
I haven't played around with cloning the repo, but I will try that this weekend. I'm thinking it might be necessary to create local branches from the DVD remotes so that they'll get carried along when you clone the repo.
As far as the repos on the DVD's not having a shared ancestry with main repo, that was a conscious choice that I made. I wanted to add as little extra data to the DVDs as possible since I usually fill them to the brim anyway. I didn't feel that it would be beneficial for the DVD's to know about the history of the main repo and other files that they don't contain. Furthermore, besides all the links and history, you'd be replicating all the files in the main repo that aren't annexed.
If you want to avoid the error, but still have a local branch for the DVD repos you should be able to do something like the following:
WARNING: these commands are untested!
Working from the original example, you should then get local branches for the DVDs that don't have a common ancestor with your master local repo. I haven't actually tested that though. Testing will have to wait for this weekend.
Thanks for the suggestion Joey, I found a way to make the web remotes work for adding the files from existing discs. I wound up adding a symlink farm to the repo with a link for each disk pointing at the mount point. This way when I try to retrieve a file, I see the URL which contains the name of the disc:
It took me a while as the version of git-annex in portage was rather old and I just didn't get around to updating git-annex on Gentoo for while. If anybody wants to get git-annex 3.20121112 running under Gentoo I detailed the process I used at http://git-annex.mysteryvortex.com
Now I'll have to try out the assistant! (Though I didn't get the webapp to compile due to a shakespeare-js error)
Without having read this, I've reported a very similar wishlist item at: http://git-annex.branchable.com/todo/wishlist:_recursive_directory_remote_setup47addurl
combining a recursive addurl (in my case using --fast) script with the suggestions regarding symlinks here, it's somewhat workable:
Ideally though, for optical media it would have a couple of more features (some already noted above):
not the original disc... trying anyway... file hash mismatch... please enable the remote disc with "MYLABEL" and creation date "2001-01-01"
This last item would make
git-annex
suitable to catalog existing WORM media.In the past I have used some programs but was never satisfied with their graphical-UI-first approach or closed format. For example: gtktalog, cdcat, cdcollect, where is it, virtual volume view, gnome catalog, basenji. Ref: https://alternativeto.net/software/cdcollect/?platform=linux .
I also used at some point a plain old
find|{stat;md5}|gzip > ~/catalogs/my_volume_id.gz
thengrep mystring ~/catalogs/*gz
which, at the end of the day, has an overall good cost/benefit ratio.IMHO git-annex has a sane foundation and the potential to do better than those tools.
Technically this looks indeed similar to a web special remote, but needs to accommodate for arbitrary mount point and keep count of copies.
To be honest, the DVD use case is not a priority for me at the moment, but it feels like a missing piece in an otherwise good puzzle. As if handling this case nicely would actually benefit other, more modern use cases.
Inspired by the OP I devised a slightly modified method for cataloguing my optical media library.
The goal is having a main git-annex repo tracking what's on HDD, while adding the optical media as one-time remotes for location tracking and easy future inspection of their contents without having to physically insert the disc in the drive and mounting it.
The current contents of the HDD and the contents of the total set of the optical discs are not a 1:1 match, both have files not present in the other.
Current plan:
copy already burnt discs to a directory on disk and create a git-annex repo with the same configuration as the main catalogue repo, using the disc label as description:
git annex init disc_label && git annex config --set annex.dotfiles true && git config annex.backend MD5 && git config annex.gitaddtoannex false
add the optical disc repo as remote in the main repo:
git remote add CD1 /mnt/CD1/ && git annex sync --no-push -jobs=cpus && git annex sync --cleanup && git remote remove CD1
create a branch for easily "mounting" each disc with git checkout without needing to
git log --oneline | grep disc_label
for finding the relevant commit hash:git branch CD1 XXXXXXXXXX
edit working tree as desired, then override git-annex sync's automatic commit:
git add . && git commit --quiet --amend --message="add CD1"
repeat for other discs
The main repo should end up looking like so, each optical disc "dangling" from the main branch that tracks the HDD contents:
and any file given as argument to
git annex whereis
should be easily traceable:Seems good enough for me but considering I can barely wrap my head around all that git and git-annex are capable of I'm posting here hoping for a sanity check that this solution does work as expected and/or suggestions for improvements before fully committing to it.
git-annex now has the ability to import a tree of files from a directory special remote, which results in a remote tracking branch, the same as you'd have after fetching a git remote.
The --no-content option avoids copying files to the local disk, although their content still will have to be read to hash them. If you want to copy the files from the disk at the same time, omit that option.
After that, you can use the dvd/master branch it created in whatever way you desire. Also if you want the discs files to end up in a subdirectory, that can be specified when you import, eg "master:dvd" will put the files into a dvd/ subdirectory.
Using this with multiple discs would probably work best if there was a way to mount each DVD to its own unique location.
Thanks for the suggestion. I've looked into directory special remotes and using them for the optical media appears to undermine the intent of step 3 in my previous reply, of "mounting" each disc using
git checkout DISC_LABEL
, because the master branch contents are combined with the imported directory special remote contents.The
git checkout
should leave the working directory with an 1:1 copy of the directory tree of the imported disc, except with all files replaced by broken annex symlinks.But I'm considering the opposite now: using the directory special remote not for the optical discs but for the master branch of the repo instead, the one that tracks the local HDD tree:
The local dataset I want to use as the seed for the catalogue has multiple hardlinks so making a git-annex repository directly within it is out of the question as it would lead to duplicated data.
The initial plan to work around that was making a reflink copy of the directory tree, initialising the git-annex repo therein, and regularly update its master branch by replacing the git working directory with a brand new reflink clone and
git-annex add
'ing it.If I understood git-annex right, this would imply a full re-read of the whole dataset because of the changed inode numbers of the new reflink clone, despite the contents, filenames, and mtimes of most files being 100% identical.
However, it seems that using a directory special remote would neatly circumvent that (at least until the current HDD dies and I'm forced to
mkfs
in the replacement) because git-annex would be smart enough to detect renames by looking at the stable inode and mtime of the moved files.The local dataset is around 250K files and 4TiB in real size, ballooning to over 8TiB if hardlinked files were counted as copies. The updates (using
git-annex import master --from HDD --no-content
) to the catalogue master branch would happen with frequency somewhere between monthly to every 2 years.2 questions:
Am I correct in assuming that re-importing a special remote would only read the newly added files and correctly detect all renames and deletions without re-reading, no matter how much time passes between re-imports of the master branch?
Are there any downsides (scalability, memory use, etc) to using a directory special remote for this use case instead of a regular git-annex repository?
Huh? Of course you can import to a separate branch.
I wasn't clear enough, or maybe I did something wrong when testing the directory special remote.
When using standard git-annex repositories for everything as explained in comment 14 I get this:
The master branch tracks only the contents of my local HDD (note that in step 4 I edit the working directory to my liking and amend the automatic
git-annex sync
commit), and the CD1 branch contains only what's in that disc and nothing else, such thatgit checkout 2222222222
orgit checkout CD1
replaces everything in the working directory with the contents of CD1, similar to mounting the physical disc and browsing its filesystem.Using the directory special remote, the contents of my master branch are combined with the contents of the special remote, so the same
git checkout CD1
command wouldn't replicate exactly what's on CD1, the master branch directory tree would still be present after the branch switch.git annex import --from=CD1 CD1
, which will import everything on the CD1 special remote to a branch named "CD1" which is completely independ from the master branch (it won't even share the same history).Aha, that's what I was missing!
I was working under the wrong assumption that the git-annex location tracking was tied in some way to the master branch and that alternate branches required shared history with master to have their contents tracked.
Wholly independent branches for each disc is exactly what I want, thanks Lukey and Joey for your suggestions.