very slow on exfat drives

I want to use git-annex to keep track of and archive of large tarballs (on the order of 10 to 100GB each). One of the locations are a set of external HDDs that are formatted to exFAT.

Unfortunately every git command takes hours to execute. e.g. every time I use git status the index is refreshed which takes about 3 hours, committing a single takes similarly long.

Is there anything I can do to speed things up?

RSS Atom

I guess this question is about backends

As so often happens, search the docs for hours and don't find anything and right after I asking the question I found a (seemingly) relevant page I had somehow missed before.

In this case backends.

Creating a new annex repo on a drive that already had the data but no repo and changing the backend to WORM (with git config --local --add annex.backend WORM) seems to have made things a little bit faster. Adding the first 3.5GB file in just under 2 minutes and git status returning after 44 seconds when first run and thereafter returning instantaneously. However adding all ~3TB in the repo is shaping up to take multiple days anyway.

Is there anything else I've missed? I don't see what could be taking so long if all that is being checked is mtime, name and size.

Comment by imlew — Thu Nov 16 12:51:59 2023

Remove comment

comment 2

Normally, using a bare repo helps on slow hardware/filesystems. This means making yokr repo on the HDD a bare repo, then adding your files somewhere else, where it's fast, and having git annex sync it over to the HDD. Uncool, but it seems lile exFAT or your HDD is rather on the shitty side. File size shouldn't matter much, the amount of files is often a problem.

Comment by nobodyinperson — Thu Nov 16 15:54:08 2023

Remove comment

comment 3

Ah, and make sure you're not on an adjusted-unlocked branch. And only use locked files. If exFAT can't do symlinks properly, that might be the problem. Unlocked gigantic files are also a bottleneck.

Comment by nobodyinperson — Thu Nov 16 15:55:49 2023

Remove comment

comment 4

Yep, apparently, exFAT can't do symlinks.

In general, it is best to use git-annex on non-shitty filesystems or one will run into problems with their limitations -- something git-annex can't really do much about.

But you can work around it by using a bare git repository on the HDD as I mentioned above (see here for how to do that) or to use the HDD just as a directory/rsync special remote from other git-annex repos. In both cases, symlinks are not needed and no expensive slow piping through smudge/clean filters is done. The downside is that you can't work (i.e. add, read, modify files) directly on the HDD, but you only use it as a storage drive.

Comment by nobodyinperson — Thu Nov 16 16:34:12 2023

Remove comment

not about backends after all

Thanks for your help. I've created a bare repo on one of the drives that didn't have a repo yet and have been moving the files to my laptops internal drive to add and commit and then back with git annex move, this seems to be working much better. (I already have more than half the files in the repo after one night, previously I had left it running for days and couldn't get that far. (And yes, the drives are very slow.))

re the filesystems, I thought so, but unfortunately these have to compatible with both mac and windoze...

Comment by imlew — Fri Nov 17 08:09:22 2023

Remove comment

one more question

I'm not sure if this is caused by the disk's repo being bare, but after I have git annex moved a file there I still need to run git annex sync in the local source repo before I can find or copy the file to a third repo (in this case a second repo on my laptop).

This is a little confusing because it seems the bare repo on the disk doesn't know it has the files that have been moved to it.

Is this because it is bare, am I doing something wrong or why doesn't git annex move result in the target knowing that it has a given file?

Comment by imlew — Fri Nov 17 08:23:32 2023

Remove comment

comment 7

I'm not sure exactly what you mean. If you git annex moved a file from a local repo to the bare repo on the HDD, your local repo should know about this immediately. I'm not sure if an immediate git annex move on the HDD afterwards knows about this. Sounds like it should but maybe it doesn't for performance reasons. That would explain the need for a subsequent sync. In general, I would recommend setting up preferred content expressions for each repo, and then always just run git annex assist to have it sync everything. Slower than manual moving and copying though, but less worrying.

Comment by nobodyinperson — Fri Nov 17 08:45:56 2023

Remove comment

comment 8

To try out accessing the file from another location I created a second repo on my laptop. So I have the "real" local repo (L), the repo on the disk (D) and another local repo for testing (T). In L I added and committed $FILE and then moved it to D. If I now run git annex whereis $FILE in all the repos L tells me it's in D, while both D and T tell me the file isn't know to git. Only when I run git annex sync in L does T know and D still doesn't.

Not a big issue, just a little suprising, but it's fine to have to remember to run sync in the local repo before disconnecting the disk.

I have started looking into preferred content and groups and I will most likely use them. At least to begin with I want to try doing things manually and then later on move on to the more sophisticated tools.

Comment by imlew — Fri Nov 17 09:41:19 2023

Remove comment

comment 9

Alright, that's how it's supposed to be. Copying files around and keeping track of the locations are two separate things. The location tracking, metadata, etc. are stored in the git-annex-branch, which is only synced with git annex sync or git annex assist. If you're on the manual route (i.e. no preferred content, no git annex sync --content, no git annex assist), then you are supposed to sync the git repos yourself, e.g. with git annex sync. It also makes sense from a performance standpoint. git syncing can be slow, especially on slow hardware. Maybe you don't want to sync the metadata after every copy/move/drop/etc., but you batch it up. And as long as the info where the files are is somewhere in a git-annex-branch (e.g. your local repo L), it's fine as it will eventually be synced around.

Comment by nobodyinperson — Fri Nov 17 09:49:02 2023

Remove comment

thank you

Makes sense. Thanks again, you've helped me a lot.

Comment by imlew — Fri Nov 17 11:43:39 2023

Remove comment

comment 11

What might be happening on the exfat drive is, every time that filesystem is mounted, it generates new inode numbers for all the files. So when you run git status, git sees the new inode and needs to do work to determine if it's changed. When the file is an annexed file that is unlocked (which all annexed files necessarily are on this filesystem since it doesn't support symlinks), git status needs to ask git-annex about it. And git-annex has to either re-hash the file (for SHA) or do a smaller amount of work (for WORM).

A bare repository does get around that. But what I tend to use in these situations is a directory special remote configured with ignoreinodes=yes.

Comment by joey — Wed Nov 29 17:36:28 2023

Remove comment

comment 12

Hi joey,

if it was slow because the inodes changed it should only be slow the first time git status etc are run. What I experienced was that it got slower the more files were in the repo whie the drive was continuously connected.

Thanks for your suggestion to use a directory special remote, but it's not clear to me how that would be an improvement over a bare repo. The only drawback to using a bare repo is the lack of a working tree and special remotes don't seem to have that either.

Comment by imlew — Tue Dec 12 11:56:05 2023

Remove comment

comment 13

A directory special remote is just a bunch of files. A bare repo has the git history and all the metadata for the bunch of files. Git itself on slow and bad filesystems is not fun and git-annex having to comb through many git objects to extract metadata for the actual annexed files is most likely the bottleneck here. Best is to not run any git-annex commands directly on the bad filesystem, but elsewhere and operate the bad filesystem repo as a remote. Then you let git-annex gather its information on a fast filesysetm and hardware and let it do only the copying of real files to and from the bad filesystem. At least that's my experience.

Comment by nobodyinperson — Tue Dec 12 12:44:38 2023

Remove comment

comment 14

The speeds reported by get and copy were similar to what rsync reports when I just copy files to and from the disks. It was really just the work tree (and I guess the clean/smudge filters) being slow.

And bare repos have the advantage that they carry for metadata and possibly file history.

Comment by imlew — Tue Dec 12 13:36:20 2023

Remove comment

comment 15

The fact of the matter is that HDDs have gotten shittier during the past decade or so because most of them (except for sizes above 8TB and drives meant for the enterprise) already employ SMR (shingled magnetic recording) instead of conventional recording techniques. It seems SMR is poison to all sorts of workloads having small files and directories being rewritten (not that drives employing it have sequential speeds that are exactly stellar either) like what git and git-annex are doing under the hood. I bought an 6TB WD Elements desktop drive recently (knowingly an SMR unit because non-SMR HDDs are rather expensive here in Finland) as an git-annex archival drive and I was flabbergasted at how slow it turned out for merely syncing git metadata. I'm on Windows and have to use adjusted-unlocked branches so there's that but NTFS on Windows is not terribly slow -- maybe not as speedy as Linux filesystems but it caches metadata rather well, so it's ok. The problem is that while my 4TB Seagate IronWolf non-SMR drive (git-annex) syncs in a minute or two, my new 6TB takes a minimum of ten minutes or so to do that. At this point I have just resigned myself to the fact that all my future archival drives where I use regular git remotes will be slow as molasses. I'd love to own big, fast non-SMR enterprise drives but those will be outside my budget for years to come, I'm afraid.

Comment by jkniiv — Tue Dec 12 19:25:16 2023

Remove comment

Workaround for WSL2 on Windows

Seems like a huge repository (mine is around 500GB) is barely feasible for git-annex on crippled filesystems. The workaround I'm currently using - is creating .vhdx virtual drives on Windows, attaching them to WSL2 and formatting as ZFS. Then I use it from WSL2. I've also made them available to Windows through FUSE bindfs --resolve-symlinks + samba share. This setup works almost all the time, except some programs (on Windows side) can't scan some directories with lots of files on samba due to timeouts. Though, it works perfectly inside the WSL2.

There are other downsides as well:

files can't be accessed on a system without WSL2
.vhdx sometimes grows in size, and it's not always easy to shrink it back

What is probably feasibly on ExFAT especially on Linux - is using nlinkfs which can simulate symlinks on crippled file systems. In theory it might even be possible to build it for Windows and use with Dokan. But I haven't tried it.

Comment by psxvoid — Tue Dec 31 05:49:31 2024

Remove comment

Add a comment

Last edited Wed Nov 15 16:37:12 2023