I’m just experimenting with git-annex on a repo of ~150k files in about the same number of directories (WORM-backend). Calling, e.g., «git annex status» will take several minutes while stat()-ing all the files (for changes, I presume).
This might already be on you todo list, but I was wondering whether it is possible to increase the performance of «annex status» (or related commands) when «annex watch» is running in the background. In that case, «status» could rely on cached data built-up at some point during initialization, plus based on the data that was accumulated via inotify. (Hopefully, all this won’t even be needed anymore on btrfs at some point in the future.)
(I’m not very knowledgable in these things, so just out of curiosity: I noticed that, even though the «status» invocation takes ages, no HDD activity occurs and all the metadata is probably already in the Linux caches from a run I conducted immediately beforehand. Why do you figure that is? Is context switching so hugely expensive?)
Neither
git annex watch
nor theassistant
persistently store any data about the files in the repository in memory. They cannot speed upgit annex status
.I'm not sure what's the point of running
git annex status
while the daemon is running, since the daemon will normally immediately notice changes and commit them. The status output should thus generally be empty.FWIW,
git annex status
takes 0.3s in my largest repo (55k files), on an SSD. That's in indirect mode, and the time is almost completely spent in runninggit ls-files --modified
, which is probably about as fast as it can be. In a direct mode repository, it will be rather slower, since it has to query git for the key that was last committed for each file, in order to look up that key's info and see if the file has been modified.Yes, you’re probably right that the benefit of this is slim when the watching daemon auto-adds new files. So the «status» output would never change and show the status that upheld before starting the daemon.
The reason I brought this up that I recall reading a comment of yours somewhere on the site, to the effect that the assistant can sometimes speed up certain operations, because it can make certain valid assumptions on the state of the repo due to the fact that the assistant has been monitoring it. I don’t recall what those operations were, though. That’s why it occurred to me whether there might be a daemon that just monitors via inotify, and neither adds nor syncs, and only provides information to certain commands to speed things up under some circumstances.
In general, is it accurate to say that git-annex mostly takes the «space» option when making a space/time-trade-offs? I noticed that the memory consumption is really slim most of the time, and wondered whether there might be options of speeding operations up by relying on more memory instead (perhaps also doing persistent caching). On the other hand, in some regards you are probably committed to the time/memory trade-offs taken by vanilla git, so maybe there’s not much room for improvement, git-annex wise…
Maybe direct-mode repos on the order of 100k’s of files are just not practical. I’m using indirect mode for my really big repos now, and it’s now responsive enough to use (except for «annex unused», which is inherently expensive, as you once explained). At least commiting won’t take tens of minutes that way. I’ll just have to make the software play nicely with the symlinks.
BTW, the file-system seems to have a huge impact on this. My large direct mode annex is practically unusable on ext (tens of minutes per commit), but still usable on btrfs (a few minutes). I’m migrating one disk to btrfs at home and will do some controlled benchmarks then. The added bonus is that directories don’t always take up a full block.