Kickstarter is over. Yay!
Today I worked on the bug where git annex watch
turned regular files
that were already checked into git into symlinks. So I made it check
if a file is already in git before trying to add it to the annex.
The tricky part was doing this check quickly. Unless I want to write my
own git index parser (or use one from Hackage), this check requires running
git ls-files
, once per file to be added. That won't fly if a huge
tree of files is being moved or unpacked into the watched directory.
Instead, I made it only do the check during git annex watch
's initial
scan of the tree. This should be OK, because once it's running, you
won't be adding new files to git anyway, since it'll automatically annex
new files. This is good enough for now, but there are at least two problems
with it:
- Someone might
git merge
in a branch that has some regular files, and it would add the merged in files to the annex. - Once
git annex watch
is running, if you modify a file that was checked into git as a regular file, the new version will be added to the annex.
I'll probably come back to this issue, and may well find myself directly querying git's index.
I've started work to fix the memory leak I see when running git annex
watch
in a large repository (40 thousand files). As always with a Haskell
memory leak, I crack open Real World Haskell's chapter on profiling.
Eventually this yields a nice graph of the problem:
So, looks like a few minor memory leaks, and one huge leak. Stared at this for a while and trying a few things, and got a much better result:
I may come back later and try to improve this further, but it's not bad memory usage. But, it's still rather slow to start up in such a large repository, and its initial scan is still doing too much work. I need to optimize more..