Hi everyone,
I want to lay out a couple of use cases here.
I have several large (1 TB +) media collections. Some are often mounted read-only. Others are very sensitive to changes -- I definitely don't want to risk anything that might munge timestamps, etc. So my requirements are:
- Must not modify the files in the existing collection in any way. No changing timestamps, no converting them to hard or sym links, etc.
- Must not store an additional copy of the data locally (I don't have space for that)
- Must be able to handle the data store being read-only mounted (.git can be read-write)
I want to use this for, in order of importance:
- Archival to external USB drives. Currently I do this with rsync and it's a real mess figuring out what's where and what to do when a drive fills up.
- Being able to easily selectively copy some of the files to a laptop or Linux-using tablet for offline viewing
- Being able to queue up files to add from a laptop/tablet
I'm not worried about the .git directory itself; I can bind-mount the existing store to be a subdirectory under a git-annex repo, so that would be fine.
So here's what I've looked into so far. All of these are run with git annex adjust --unlock
(or the assistant, which does the same thing):
- A directory remote with importtree=yes would work well for use case #1. However, since the rsync backend doesn't support importtree, it would be challenging for #2 (I guess I could make it work via sshfs, but that gets a bit nasty)
- I tried bind-mounting the existing data under a git-annex repo to use that as the source. This does work; however, presumably because it can't hard link the files into .git/annex, it results in doubling the storage space requirements for the data. That's not usable for me.
- I thought maybe a transport repo would help. So I could have, basically,
source->transport<->laptop
andsource->transport->archive
. The problem here is that git-annex can't copy directly from source to laptop or archive in this scenario without duplicating the data in transport. So I still can't just use get from the laptop to get things unless I use 2x the space, which again, I don't want to do. - I thought about maybe adding git-annex directly to an existing directory. That risks changing things about it (since it is necessarily read-write to git-annex). I'm not really comfortable with that yet.
Incidentally, I mentioned timestamps and didn't say how I'll preserve them for the archive drives. I can use mtree from Debian's mtree-netbsd package and do something like this on the source directory:
mtree -c -R nlink,uid,gid,mode -p /PATH/TO/REPO -X <(echo './.git') > /tmp/spec
And on the destination, restore the timestamps with:
mtree -t -U -e < /tmp/spec
I imagine some clever hooks would let me do this automatically, but I don't really feel the need for that. I think this is easier, for me, than the discussion at does not preserve timestamps.
First of all you really want to look into/migrate to reflink-capable filesystems like XFS or btrfs.
I don't know why you'd need to use the rsync special-remote for case #2. You create git-annex repos on your usb drive, add the existing collection as a directory special-remote with
--import-tree
and import everything. Then you clone the repo to your laptop and cangit annex sync/get/copy
from the usb drive however you like. I think you can evengit annex enableremote
the import special-remote on your laptop, and then git-annex will get files directly from it. Heck, you could evengit annex import --no-content
and only have the file metadata imported, but none of the content actually stored in git-annex and then you can selectivelygit annex get
files directly from the special-remote.Also, you may want to set
git annex config --set annex.dotfiles true
on each of you repos. All of these options are documented in the git-annex manpage (also look at the git-annex-config manpage).Thank you for these thoughts!
I should have mentioned that I intend the USB drives to often live offsite, so they would be disconnected. You are quite correct, though, that if they are onsite I could think of them as the sort of "hub" repository and do everything from them like that.
Doing the enableremote for the special directory remote on the laptop does require it to be mounted as a filesystem there, hence my mention of sshfs. That can work but is a bit clunky.