Tracing "slow performance" of our git-annex'ification of zarr archives in dandisets. Initially I thought that may be a flat directory structure of the journal is a culprit but Joey showed that it is unlikely the case.
I think that part of the heavy IO/CPU of git-annex registerurl --batch --json --json-error-messages
call comes from the fact that some files in those zarr trees are identical through hierarchy, leading to over 8k URLs to be added for that annex key. In turn, according to strace, I think that git-annex to update listing of urls for a key first creates a copy of prior version of .web under othertmp/
and then moves it over under journal location, e.g.
$> grep journal registerurl-3644019.log | head
3644019 openat(AT_FDCWD, ".git/annex/journal.lck", O_RDWR|O_CREAT, 0666) = 11
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
3644019 openat(AT_FDCWD, ".git/annex/journal.lck", O_RDWR|O_CREAT, 0666) = 11
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 16
3644019 mkdir(".git/annex/journal", 0777) = -1 EEXIST (File exists)
3644019 stat(".git/annex/journal", {st_mode=S_IFDIR|S_ISGID|0755, st_size=24299080, ...}) = 0
3644019 rename(".git/annex/othertmp/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web", ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log.web") = 0
3644019 openat(AT_FDCWD, ".git/annex/journal.lck", O_RDWR|O_CREAT, 0666) = 11
3644019 openat(AT_FDCWD, ".git/annex/journal/5a4_81f_MD5E-s560--82188063b1988362cc3050918f493320.log", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 16
and that file/key eventually grows into 1MB in size. So for 8k URLs it would take about 4000MBs = 4GB of write IO to produce such growing file while adding those 8k URLs.
Full strace output for those few seconds could be found here.
I wonder how we/git-annex could improve its performance for such cases?
May be changes to those .web files in journal could be done "in place" by appending to those files, without needing to copy/move?
may be there is a way to "stagger" those --batch additions somehow so all thousands of URLs are added in a single "run" thus having a single "copy/move" and locking/stat'ing syscalls?
PS More information could be found at dandisets/issues/225
You have not ruled out the flat directory structure being a problem, if your system has different performance than mine. It would be good if you could try the simple test I showed there to check if reading/writing a file to a large directory is indeed a problem.
Anyway, nice observation here; growing such a large log file one line at a time with rewrites is of course gonna be slow. That would be a nice optimisation target.
(Also the redundant mkdir/stat/etc on every write are not helping performance. git-annex never rmdirs the journal, so those should be able to easily be eliminated by only doing a mkdir when a write fails due to the journal directory not existing.)
@yoh is your use of registerurl typically going to add all the urls for a given key in succession, and then move on to the next key? Or are the keys randomly distributed?
It sounds like it's more randomly distributed, if you're walking a tree and adding each file you encounter, and some of them have the same content so the same key.
But your stace shows repeated writes for the same key, so maybe they bunch up? If it was not randomly distributed, a nice optimisation would be for registerurl to buffer urls as long as the key is the same, and then do a single write for that key of all the urls. But it can't really buffer like that if it's randomly distributed; the buffer could use a large amount of memory then.
Simply appending would be a nice optimisation, but setUrlPresent currently compacts the log after adding the new url to it. That handles the case where the url was already in the log, so it does not get logged twice. Compacting here is not strictly necessary (the log is also compacted when its queried), but committing a log file with many copies of the same url would slow down later accesses of the log file.
So it needs to check before appending if the url is already in the log (with the same or newer vector clock). This would still be worth doing, it avoids the repeated rewrites. But there would still be the overhead of rereading the file each time.
Hmm... git-annex could cache the last journal file it wrote, and only use that cache to check if line it's writing is already in the file before appending to the file. Using the cache this way seems like it could avoid needing to invalidate the cache when some other process modifies the journal file. There are two cases:
Unfortunately, with random distrubution, as discussed above, that caching would not help since git-annex can't cache every log for every key.
Anyway, writes are the more expensive thing, so it's still worth implementing appending, even if it still needs to read the log file first.
I've optimised away the repeated mkdir of the journal.
Probably not a big win in this particular edge case, but a nice general win..
Writing to the journal is currently atomic. And git-annex does take advantage of that atomicity, by not locking the journal when it's reading from it in some cases, the most used of which is Annex.Branch.get.
But, it's not guaranteed that an append is atomic. A short enough append may be, but how short may vary, and it's not well defined. Here is an example of a short write append that gets interrupted in the middle by a kill signal: https://bugzilla.kernel.org/show_bug.cgi?id=55651
So appending would need more locking of the journal, which would add some overhead to everything. And especially would hurt concurrency.
Also, the journal is currently crash-safe. Even if there's a sudden power loss, the write either completed or didn't happen. Appending would lose that nice property.
@yarikoptic ok, please check and, if you can do that, I'll implement the buffering of urls for a key.
It looks like appending is not feasible..
Only other approach I can think of would be to have a switch that makes git-annex buffer branch writes in memory, rather than using the journal, and commit at the end, or when the buffer got too large.
apparently (according to John) it would require us to prefetch entire listing of the bucket instead of starting to work on the obtained information as soon as first page comes in. So, "no free lunch" again.
Re append: I wonder if it would anyhow be faster, at least for me, if append was done through
cp --reflink=auto
of that file first intoothertmp/
, append there, and thenmv
... most likely not or not as much as worth it.Operating (optionally) a journal in memory (if I got "git-annex buffer branch writes in memory" correctly) sounds like a possible cool speedup overall but not sure how that could work across multiple git-annex instances.
I think that CoW and append could indeed speed it up. An append on a CoW filesystem should be able to keep the original file without copying it, and just add a new block for the append to the other file. I did a quick test on btrfs, starting with a 100 mb file, making a cp --reflink, and appending to it. All operations were less than 10 ms.
The buffering in memory would be while the process was running, then it would commit it to the git-annex branch. So if no other process needs to see that data while the process is running, you'd be ok.
Rather than buffering in memory, it could buffer to a temporary journal, and merge that with the main journal at the end. That would let it append in place without worrying about locking, and memory use would not matter either. In the common case, files could just be moved from the temp journal to the main journal, which would be cheap.
Same as buffering in memory, this would change the normal behavior where other processes can see the changes made by a --batch process while it's running. So it would need to be a non-default mode.
I think though, that before implementing any of these things, I should first benchmark how much overhead there would be in locking the journal around read operations.
Ran
git-annex whereis --quiet
over 10000 annexed files. With journal locking on read, it took 11.71 seconds. Without journal locking, it took 11.73 seconds. No speed difference.And strace showed why: This only opened the journal directory once, noticed it was empty, and skipped ever trying to read any files from it! If there are files, it stages them and still manages to not need to read from the journal after that. Nice optimisation from earlier this year.
I thought that --batch commands would still check the journal files, but surprisingly, they don't seem to. That was a bug: batch commands miss journalled changes made while running
After fixing that, I benchmarked feeding 10000 filenames into
git-annex whereis --batch
. With journal locking on read, it took 18.43 seconds. Without journal locking, it took 17.22 seconds. Before that bug fix, with or without journal locking, it took 16.59 seconds.So, if the slow down caused by journal locking on read is a problem for anyone, a mode could be added that makes --batch not check the journal for changes made after the command started. That would make it run as fast as before that bug fix.
There might be other commands than --batch commands, that both read and write git-annex branch data, and so end up checking the journal on every read, since writing invalidates the above optimisation. Not sure what commands that would be, maybe
git-annex drop
? Anyway, such commands are probably doing more expensive things than locking the journal; they're not query commands.That makes me ok with adding the locking on read, if needed for append. (Or similar added overheads to journal reads.) For now, I've committed it to the
append
branch.The remaining problem with appending is crash safety. If an append is not atomic, a journal file could end up having a truncated line written to it.
That seems unlikely, but see the bugzilla page above; it can happen on a kill signal at least.
So, can append somehow be made atomic? How about this:
Make
.git/annex/journal-append/
which contains append files, that are the same as journal files, but in the process of being appended. And make it also contain size files, which contain a number, the size of the append file before anything got appended to it.Then, to append to a journal file:
When reading journalled files, it would need to also check the append file, and only read the recorded size. When both the append file and the journal file exist, it would read both and combine them. This change would slow down reads slightly, though as seen in comment #10, mostly only for --batch commands.
(It may not be necessary to lock on read actually. It can check for the append file and read the size file. If a write is happening at the same time, the size file may not exist yet, or may have been deleted already. In either case, reading the whole append file is ok. Should be possible to make this race-safe without locking.)
When staging the journal, it would need to first handle any interrupted appends, by checking if any append files exist.
When a new git-annex is doing an append and an old git-annex is also in use, the old git-annex will not see files in the journal that are in the process of being appended to. So it might use out of date information for queries. When it's making a write, it always did first read with the journal locked, so it will block until the append is complete. So it will not use out of date information for writes.
Only when something was written to the journal, but not committed to the branch, and then an append happened but got interruped will the old git-annex miss data. It will not see that data, and might make its own divergent changes, that get committed to the branch. The new git-annex will need to deal with this when handling interrupted appends.
FWIW, not sure when/if I would use that mode then fearing for introducing some inconsistencies. Or do you see legit use case (like mine) where it should be safe?
The
append
branch has basic appending implemented, but it's not yet done atomically.For benchmarking, I'm using this command.
ITER=2000
Old: 52s
Appending: 28s
ITER=4000
Old: 190s
Appending: 111s
So an improvement of 50%. But remains nonlinear even when appending, because it needs to read the existing log file each time to determine if it can append, or if it needs to compact it. (Disk cache didn't work as well as I had hoped.)
What this suggests to me is that it would be good to also add a mode that blindly appends without compacting. Or, possibly, to blindly append, but then compact the journalled file before committing it to the git-annex branch.
@yarikoptic, the new git-annex would resolve the insonsistency the next time it ran. Only when annex.alwayscommit=false would there be any time window where the old git-annex missed something written by git-annex process that ran before the one that got interrupted. This does not seem like a large problem.
whereis
and theregisterurl --batch
at the same time -- i.e. wouldwhereis
have up-to-date information about the most recentregisterurl
added url? Although may be in our particular case that particularwhereis + registerurl
would not be used.Added the mode that appends without first reading. Enable with
-c annex.alwayscompact=false
.Now some real speed on the benchmark!
ITER=2000
Appending, annex.alwayscompact=false: 2.6s
ITER=4000
Appending, annex.alwayscompact=false: 2.8s
ITER=8000
Appending, annex.alwayscompact=false: 3.14s
I think annex.alwayscompact=false will be ok for what you're doing in this case. It's probably not a good idea to set it generally, although worst case is bloat of the git-annex branch causing slow reads from it.
An example of when not to use it is, if you have annex.alwayscommit=false and annex.alwayscompact=false, and you run git-annex get followed by git-annex drop, the location log will be larger than it needs to be. Which does not cause any problem except for slow reads from the git-annex branch, and will be cleared up by the first update of that log when annex.alwayscompact=true.
Still need to implement the atomic appending.
whereis would have up-to-date information, except when it was the old git-annex that does not understand atomic appends. Then there would be a narrow window, during each atomic append, where it would not see any of the changes that registerurl had written to the log file, and would instead see an old value from the git-annex branch.
Hmm. Maybe that is a problem. I dunno.
This could be repository version 11, but there is unfortunately not a way to guarantee that an old git-annex process from before the repository upgrade is not still running and getting confused. The
v9-v10
upgrade waits an entire year to ensure no such processes are left, and a similar wait would be needed here.Still, you could force upgrade the repositories that you need to be fast right away, or initialize new repositories at the new repository version. Since most people are not hit by this kind of performance problem, waiting a while before it's available everywhere is not a problem.
If we don't need to worry about breaking an old git-annex process, thanks to an upgrade process that makes sure there are not any, a simpler and faster method for atomic appends is as follows:
On upgrade, add a trailing NUL to all existing journal files.
When writing to a journal file, use a temp file that is renamed into place atomically like now, but add a trailing NUL to it.
When appending to a journal file first read the last byte. If it's a trailing NUL, write the current size of the journal file to the size file. Then rewind back before the NUL and append to there, followed by a trailing NUL. Then delete the size file.
On append, if the last byte is not a trailing NUL, read the size file, and seek to that size, so discarding a previous partial append. Then proceed to append to it, write the trailing NUL, and delete the size file.
When reading from a journal file, discard the trailing NUL. If there is no trailing NUL, a partial append has been detected. Get the size from the size file, and read only that much of the journal file, to discard the partial append.
Reading may not need the journal lock, if it can otherwise recover from races with append. Treating an append that is still in progress as an interrupted append is fine for the purposes of reading. And when there's no trailing NUL and it reads the size file, the append may have finished and so deleted the size file. If so, retry the read of the journal file, and look again for a trailing NUL.
annex fsck
orannex repack
? ) we could run after an extensivegit annex registerurl -c annex.alwayscompact=false
and friends to potentiall re-compact all the .log etc files in git-annex branch?Oh, even simpler -- let the journal assume that all the files are a series of lines. Which they always will be, I'm sure. Then:
When appending to a journal file, first read the last byte. If it's not a newline, then seek back to the previous newline, to discard a previous interrupted append.
When reading from a journal file, similarly check the last byte and if not a newline, discard back to the previous newline.
Reading would not need any locking either. A read that happened while an append was in progress and read an incomplete line would discard it.
Old versions of git-annex would not get confused by NULs with this method, the format of the branch files is not changed. So it's significantly safer, but not fully since old git-annex versions would not deal well with an interrupted append. So it would still be good to upgrade the repo version in a way that prevents an old git-annex process from lingering around.
I've finished implementing appending.
In order to avoid stacking up a year-long v10-v12 upgrade behind the current v8-v10 upgrade, appending will for now only be done when annex.alwayscompact=false. (v11 changes has been made to remind me to revisit this later.)
That config is necessary for optimal speed in your situation anyway.
You should avoid setting annex.alwayscompact=false if there is an older git-annex also installed on the machine, that could be run in the repository at the same time as a git-annex process that is doing appends. Otherwise, you do risk the old process seeing partial/interrupted appends and getting confused.
@yarikoptic let's discuss such a hypothetical command in another todo if you turn out to need it.
If you have a way to avoid feeding too much redundant data into git-annex registerurl, you won't need it. I guess that the 8k redundant urls would not be enough to be a problem, since it would only slow down reading a bit.
If you had a cron job that was adding the same url 8k times every run, so accumulating an growing number over time, then such a command would be useful.
Thank you Joey -- we will give it a shot soonish! (may be we should somehow add "conda" devel builds to datalad/git-annex repo ). Possibly unnecessary question just for myself to feel better :
isn't it racy between "first read the last byte." and then "If"? i.e. if some process dumps/flushes more into that file interim, wouldn't you potentially end up with some garbled line? (or there is a lock)