Please describe the problem.
massive repo, max cpu using
git annex add .
had to interrupt the job as it was processing 1 small file per 5 seconds after about 3h run.
I am running it on the root of a large (currently 1TB) exFAT-based drive used for archiving
The repo grew to 28G.
Is this a regular issue with exFAT? I've done quite a bit of searching. I'll do more.
What steps will reproduce the problem?
- install on El Capitan (latest) via homebrew
- create 1TB exFAT file store
- follow walk through to setup annex locally and on external
- add
What version of git-annex are you using? On what operating system?
git-annex version: 6.20160126 build flags: Assistant Webapp Pairing Testsuite S3(multipartupload)(storageclasses) WebDAV FsEvents XMPP ConcurrentOutput TorrentParser Feeds Quvi key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 SHA1E SHA1 MD5E MD5 WORM URL remote types: git gcrypt S3 bup directory rsync web bittorrent webdav tahoe glacier ddar hook external
El Capitan 10.11.3
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I'd love to say I have. You'll hear my shout of joy when I do.
If I've done the math right, 5 files per second over 3 hours is only 2000 files. The size of the files does matter, since git-annex has to read them all. You said the repo grew to 28 gb; does that mean you added 2000 files totalling 28 gb in size?
I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
By using a FAT filesystem, you've forced git-annex to use direct mode. Direct mode can be a little slower, but not a great deal. Adding 2000 files to a direct mode repo takes around 11 seconds here. (I did a little optimisation and sped that up to 7 seconds.)
Doing the same benchmark on a removable USB stick with a FAT filesystem was still not slow; 7 seconds again.
But then I had linux mount that FAT filesystem sync (so, it flushes each file write to disk, not buffering them), and I start getting closer to your slow speed; benchmark took 53 minutes.
So, I think the slow speed you're seeing is quite likely due to a combination of, in order from most to least important:
Also there is a fair amount of faff that git-annex does when adding a file around calling rename, stat, mkdir, etc multiple times. It may be possible to optimize some of that to get at some speedup on synchronous disks. But, I'd not expect more than a few percentage points speedup from such optimisation.
One other possiblity is you could be hitting an edge case where direct mode's performace is bad. One known such edge case is if you have a lot of files that all have the same content. For example, I made 2000 files that were all empty; adding them to a direct mode repository gets slower and slower to the point it's spending 10 or more seconds per file.