Hi everyone,
I want to lay out a couple of use cases here.
I have several large (1 TB +) media collections. Some are often mounted read-only. Others are very sensitive to changes -- I definitely don't want to risk anything that might munge timestamps, etc. So my requirements are:
- Must not modify the files in the existing collection in any way. No changing timestamps, no converting them to hard or sym links, etc.
- Must not store an additional copy of the data locally (I don't have space for that)
- Must be able to handle the data store being read-only mounted (.git can be read-write)
I want to use this for, in order of importance:
- Archival to external USB drives. Currently I do this with rsync and it's a real mess figuring out what's where and what to do when a drive fills up.
- Being able to easily selectively copy some of the files to a laptop or Linux-using tablet for offline viewing
- Being able to queue up files to add from a laptop/tablet
I'm not worried about the .git directory itself; I can bind-mount the existing store to be a subdirectory under a git-annex repo, so that would be fine.
So here's what I've looked into so far. All of these are run with git annex adjust --unlock
(or the assistant, which does the same thing):
- A directory remote with importtree=yes would work well for use case #1. However, since the rsync backend doesn't support importtree, it would be challenging for #2 (I guess I could make it work via sshfs, but that gets a bit nasty)
- I tried bind-mounting the existing data under a git-annex repo to use that as the source. This does work; however, presumably because it can't hard link the files into .git/annex, it results in doubling the storage space requirements for the data. That's not usable for me.
- I thought maybe a transport repo would help. So I could have, basically,
source->transport<->laptop
andsource->transport->archive
. The problem here is that git-annex can't copy directly from source to laptop or archive in this scenario without duplicating the data in transport. So I still can't just use get from the laptop to get things unless I use 2x the space, which again, I don't want to do. - I thought about maybe adding git-annex directly to an existing directory. That risks changing things about it (since it is necessarily read-write to git-annex). I'm not really comfortable with that yet.
Incidentally, I mentioned timestamps and didn't say how I'll preserve them for the archive drives. I can use mtree from Debian's mtree-netbsd package and do something like this on the source directory:
mtree -c -R nlink,uid,gid,mode -p /PATH/TO/REPO -X <(echo './.git') > /tmp/spec
And on the destination, restore the timestamps with:
mtree -t -U -e < /tmp/spec
I imagine some clever hooks would let me do this automatically, but I don't really feel the need for that. I think this is easier, for me, than the discussion at ?does not preserve timestamps.
First of all you really want to look into/migrate to reflink-capable filesystems like XFS or btrfs.
I don't know why you'd need to use the rsync special-remote for case #2. You create git-annex repos on your usb drive, add the existing collection as a directory special-remote with
--import-tree
and import everything. Then you clone the repo to your laptop and cangit annex sync/get/copy
from the usb drive however you like. I think you can evengit annex enableremote
the import special-remote on your laptop, and then git-annex will get files directly from it. Heck, you could evengit annex import --no-content
and only have the file metadata imported, but none of the content actually stored in git-annex and then you can selectivelygit annex get
files directly from the special-remote.Also, you may want to set
git annex config --set annex.dotfiles true
on each of you repos. All of these options are documented in the git-annex manpage (also look at the git-annex-config manpage).Thank you for these thoughts!
I should have mentioned that I intend the USB drives to often live offsite, so they would be disconnected. You are quite correct, though, that if they are onsite I could think of them as the sort of "hub" repository and do everything from them like that.
Doing the enableremote for the special directory remote on the laptop does require it to be mounted as a filesystem there, hence my mention of sshfs. That can work but is a bit clunky.
Just putting this out there, but if you are on ZFS or BTRFS, you can just duplicate the subvolume/dataset, remove what you want, and send it. It will by default verify your data integrity, and it is often faster.
On BTRFS, it is easy to
btrfs sub create send.RW; cp --reflink=always .git/annex/objects send.RW; btrfs sub snap -r send.RW send.RO; btrfs sub del send.RW
Then, on the target, I can reflink copy into the target repo's .git/annex/objects, and the
git annex fsck --all --fast
, since the send operation verified the integrity.Sometimes, if the target repo does not exist, I can take a snapshot of an entire repo, and then enter it, then re-init it with the target uuid, force drop what I don't want, and then send it. If you're dealing with hundreds of thousands of files, it can be more practical to do that.
If you want to verify the integrity of an annexed file on ZFS or BTRFS, all you have to do is read it, and let the filesystem verify the checksums for you.
If you want a nice progress display, you can just do
pv myfile > /dev/null
I considered making a git-annex-scrub script that would check if the underlying fs supports integrity verification, then just read the file and update the log.
BTRFS uses hardware accelerated crc32, which is fine for bitrot, but it is not secure from intentional tampering.