Since last post, I've worked on speeding up git annex watch's startup time in a large repository.

The problem was that its initial scan was naively staging every symlink in the repository, even though most of them are, presumably, staged correctly already. This was done in case the user copied or moved some symlinks around while git annex watch was not running -- we want to notice and commit such changes at startup.

Since I already had the stat info for the symlink, it can look at the ctime to see if the symlink was made recently, and only stage it if so. This sped up startup in my big repo from longer than I cared to wait (10+ minutes, or half an hour while profiling) to a minute or so. Of course, inotify events are already serviced during startup, so making it scan quickly is really only important so people don't think it's a resource hog. First impressions are important. :)

But what does "made recently" mean exactly? Well, my answer is possibly over engineered, but most of it is really groundwork for things I'll need later anyway. I added a new data structure for tracking the status of the daemon, which is periodically written to disk by another thread (thread #6!) to .git/annex/daemon.status Currently it looks like this; I anticipate adding lots more info as I move into the syncing stage:

lastRunning:1339610482.47928s
scanComplete:True

So, only symlinks made after the daemon was last running need to be expensively staged on startup. Although, as RichiH pointed out, this fails if the clock is changed. But I have been planning to have a cleanup thread anyway, that will handle this, and other potential problems, so I think that's ok.

Stracing its startup scan, it's fairly tight now. There are some repeated getcwd syscalls that could be optimised out for a minor speedup.


Added the sanity check thread. Thread #7! It currently only does one sanity check per day, but the sanity check is a fairly lightweight job, so I may make it run more frequently. OTOH, it may never ever find a problem, so once per day seems a good compromise.

Currently it's only checking that all files in the tree are properly staged in git. I might make it git annex fsck later, but fscking the whole tree once per day is a bit much. Perhaps it should only fsck a few files per day? TBD

Currently any problems found in the sanity check are just fixed and logged. It would be good to do something about getting problems that might indicate bugs fed back to me, in a privacy-respecting way. TBD


I also refactored the code, which was getting far too large to all be in one module.

I have been thinking about renaming git annex watch to git annex assistant, but I think I'll leave the command name as-is. Some users might want a simple watcher and stager, without the assistant's other features like syncing and the webapp. So the next stage of the roadmap will be a different command that also runs watch.

At this point, I feel I'm done with the first phase of inotify. It has a couple known bugs, but it's ready for brave beta testers to try. I trust it enough to be running it on my live data.