I want to use git-annex to keep track of and archive of large tarballs (on the order of 10 to 100GB each). One of the locations are a set of external HDDs that are formatted to exFAT.
Unfortunately every git command takes hours to execute.
e.g. every time I use git status
the index is refreshed which takes about 3 hours, committing a single takes similarly long.
Is there anything I can do to speed things up?
As so often happens, search the docs for hours and don't find anything and right after I asking the question I found a (seemingly) relevant page I had somehow missed before.
In this case backends.
Creating a new annex repo on a drive that already had the data but no repo and changing the backend to
WORM
(withgit config --local --add annex.backend WORM
) seems to have made things a little bit faster. Adding the first 3.5GB file in just under 2 minutes andgit status
returning after 44 seconds when first run and thereafter returning instantaneously. However adding all ~3TB in the repo is shaping up to take multiple days anyway.Is there anything else I've missed? I don't see what could be taking so long if all that is being checked is mtime, name and size.
Yep, apparently, exFAT can't do symlinks.
In general, it is best to use git-annex on non-shitty filesystems or one will run into problems with their limitations -- something git-annex can't really do much about.
But you can work around it by using a bare git repository on the HDD as I mentioned above (see here for how to do that) or to use the HDD just as a directory/rsync special remote from other git-annex repos. In both cases, symlinks are not needed and no expensive slow piping through smudge/clean filters is done. The downside is that you can't work (i.e. add, read, modify files) directly on the HDD, but you only use it as a storage drive.
Thanks for your help. I've created a bare repo on one of the drives that didn't have a repo yet and have been moving the files to my laptops internal drive to add and commit and then back with
git annex move
, this seems to be working much better. (I already have more than half the files in the repo after one night, previously I had left it running for days and couldn't get that far. (And yes, the drives are very slow.))re the filesystems, I thought so, but unfortunately these have to compatible with both mac and windoze...
I'm not sure if this is caused by the disk's repo being bare, but after I have
git annex move
d a file there I still need to rungit annex sync
in the local source repo before I can find or copy the file to a third repo (in this case a second repo on my laptop).This is a little confusing because it seems the bare repo on the disk doesn't know it has the files that have been moved to it.
Is this because it is bare, am I doing something wrong or why doesn't
git annex move
result in the target knowing that it has a given file?git annex move
d a file from a local repo to the bare repo on the HDD, your local repo should know about this immediately. I'm not sure if an immediategit annex move
on the HDD afterwards knows about this. Sounds like it should but maybe it doesn't for performance reasons. That would explain the need for a subsequentsync
. In general, I would recommend setting up preferred content expressions for each repo, and then always just rungit annex assist
to have it sync everything. Slower than manual moving and copying though, but less worrying.To try out accessing the file from another location I created a second repo on my laptop. So I have the "real" local repo (
L
), the repo on the disk (D
) and another local repo for testing (T
). InL
I added and committed$FILE
and then moved it toD
. If I now rungit annex whereis $FILE
in all the reposL
tells me it's inD
, while bothD
andT
tell me the file isn't know to git. Only when I rungit annex sync
inL
doesT
know andD
still doesn't.Not a big issue, just a little suprising, but it's fine to have to remember to run
sync
in the local repo before disconnecting the disk.I have started looking into preferred content and groups and I will most likely use them. At least to begin with I want to try doing things manually and then later on move on to the more sophisticated tools.
git-annex
-branch, which is only synced withgit annex sync
orgit annex assist
. If you're on the manual route (i.e. no preferred content, nogit annex sync --content
, nogit annex assist
), then you are supposed to sync the git repos yourself, e.g. withgit annex sync
. It also makes sense from a performance standpoint. git syncing can be slow, especially on slow hardware. Maybe you don't want to sync the metadata after every copy/move/drop/etc., but you batch it up. And as long as the info where the files are is somewhere in agit-annex
-branch (e.g. your local repoL
), it's fine as it will eventually be synced around.Makes sense. Thanks again, you've helped me a lot.
What might be happening on the exfat drive is, every time that filesystem is mounted, it generates new inode numbers for all the files. So when you run
git status
, git sees the new inode and needs to do work to determine if it's changed. When the file is an annexed file that is unlocked (which all annexed files necessarily are on this filesystem since it doesn't support symlinks), git status needs to ask git-annex about it. And git-annex has to either re-hash the file (for SHA) or do a smaller amount of work (for WORM).A bare repository does get around that. But what I tend to use in these situations is a directory special remote configured with
ignoreinodes=yes
.Hi joey,
if it was slow because the inodes changed it should only be slow the first time
git status
etc are run. What I experienced was that it got slower the more files were in the repo whie the drive was continuously connected.Thanks for your suggestion to use a directory special remote, but it's not clear to me how that would be an improvement over a bare repo. The only drawback to using a bare repo is the lack of a working tree and special remotes don't seem to have that either.
The speeds reported by
get
andcopy
were similar to whatrsync
reports when I just copy files to and from the disks. It was really just the work tree (and I guess the clean/smudge filters) being slow.And bare repos have the advantage that they carry for metadata and possibly file history.
The fact of the matter is that HDDs have gotten shittier during the past decade or so because most of them (except for sizes above 8TB and drives meant for the enterprise) already employ SMR (shingled magnetic recording) instead of conventional recording techniques. It seems SMR is poison to all sorts of workloads having small files and directories being rewritten (not that drives employing it have sequential speeds that are exactly stellar either) like what git and git-annex are doing under the hood. I bought an 6TB WD Elements desktop drive recently (knowingly an SMR unit because non-SMR HDDs are rather expensive here in Finland) as an git-annex archival drive and I was flabbergasted at how slow it turned out for merely syncing git metadata. I'm on Windows and have to use adjusted-unlocked branches so there's that but NTFS on Windows is not terribly slow -- maybe not as speedy as Linux filesystems but it caches metadata rather well, so it's ok. The problem is that while my 4TB Seagate IronWolf non-SMR drive (git-annex) syncs in a minute or two, my new 6TB takes a minimum of ten minutes or so to do that. At this point I have just resigned myself to the fact that all my future archival drives where I use regular git remotes will be slow as molasses. I'd love to own big, fast non-SMR enterprise drives but those will be outside my budget for years to come, I'm afraid.