In the light of the recent debugging of git-annex+datalad special remote tandem, I would have appreciated if there was a way to see reflection of git-annex actions beyond current --debug
: what is annex intending to do internally (not just reporting which commands to be ran etc). Most likely such higher level of debug information would generally be not needed, so probably not worth "enriching" default --debug
output, so I wonder if there could be an option or config setting which would set git-annex
on the path to provide more debug output if requested (e.g. like those on closing stdin/stdout/stderr/stderr-handler), may be it could be a config setting which would allow for DEFAULT|HEAVY|VERYHEAVY
or alike values.
I hope that I would never need to use that, but it might end up being "days-savior"
Every call to debugM does add overhead both in time to run it (checking if debugging is enabled is nonzero overhead and this will add up if it's done a few million times or whatever) and in time to write a useful explanation.
It seems to me that debugging is best added when trying to debug something, or when there's a readily available value that can be output (such as the safety proof that is debugged when dropping, or external special remote messages).
Trying to add enough useful debugging information to 60,000 lines of code that any possible thing it might do will have a useful debug trace seems like a recipe to end up with 100,000 lines of code. It would be easier to run the code in a debugger and step through it at that point.
debug-vs-not could be decided at start up once, and debug function could be set to noop if no debugging output (given the level requested) was desired. I really doubt that any reasonable number of noop function calls would add any notable burden on top of operation/interaction with external git commands and remotes.
yes, but then first you need to identify the code where to add it and know how to add it... users (myself included) would not be be able to do so that easily.
sure thing should not be added to every code line! But there would be value from knowing some critical "interface" points where e.g. hanging or incorrect operation could happen, e.g. if I saw earlier the output of what
trace.patch
you provided today I might have saved myself quite a lot of time, so sounds like a location of value to add someheavydebug()
logging.might not be that easy especially when relevant git-annex call is sandwiched deep in the hierarchy of datalad/git/git-annex/datalad/git-annex/datalad/git-annex calls through all the special remotes etc.
It would be possible to split out of AnnexState a value that is immutable, and have that part not gated behind a MVar, and so avoid the MVar lookup overhead, since it would just be part of the Annex Reader monad. And the debug action in that could just be a noop when it's disabled, and so be essentially zero overhead. One little problem is that option parsing (eg for --debug) currently runs in the Annex monad, so too late to change that immutable value.
Obviously having debug logging in a place where you just needed debug logging would have been useful. It does not logically follow that we can determine such places prior to your needing debug logging, or that we will need debug logging there later. After all, it seems to have been debugged now.
I suppose we can say that anything involving canceling a thread, or closing a file descriptor connected to a pipe, is fairly delicate and so it makes sense to wrap and debug log those. (Which were the things needing debug logging in your case.) But the only information available to display is "thread N" or "FD N", so it would also need to debug log each thread spawn and FD open point in order to work out which are interesting. (Or wrap the values with more information, but that rapidly gets ugly.)
Anyway, I rarely find myself adding debugM calls when debugging git-annex, generally just playing with the test case rapidly lets a theory be formed about what's happening, and then do something to test the theory, and I'm probably most of the way to a fix or at least understanding the problem.
The main way the current debug output is useful to me is it often hints at the proximate location of the problem, because of whatever git command git-annex ran just before it. It would certianly be useful to have additional debugM wherever there's a lot of code that does not involve anything that currently logs.. But I don't know a good way to identify such stretches of code.
I benchmarked calls to debugM when no debug output is enabled, and it looks like around 1 million can be made per second on my netbook. So if a 1% slowdown due to debugging is acceptable (which seems a bit above the upper bound to me), that would allow 10 thousand debugM calls per second.
Current call rate is probably always under 10 per second. Keeping it out of tight loops would probably be sufficient.
Once place I sometimes wish had debugging is reads from the git-annex branch. But running eg "git-annex get --from foo" in a repository with many thousands of files, none of them in foo, does tends of thousands of reads per second in a tight loop. Benchmarked that command with debugM of reads added, and it was 0.05% slower than without debugM.
So, to add debugM there, would need to do what comment #3 discussed.
Re comment #3, it occurs to me that currently anything that accesses AnnexState does an MVar access, which will have around the same overhead as a call to debugM. But much more frequently used.
So, anything in AnnexState that is frequently used and is immutable and could be split out to Reader will avoid that MVar access and be a free performance win. Good reason to split AnnexState, whether or not it's used to speed debugging.
I've implemented --debugfilter which can select the debug output you want to see.
Also started splitting out the immutable parts of AnnexState, but that does not yet include debug options, due to the way option parsing currently works. Once that gets done, there's a new fastDebug that can be used inside tight loop code paths.
fastDebug is implemented.. Benchmarking, I get:
So I was right that fastDebug would be faster, but I guess there's still some small overhead to check the data structure to see if debugging is enabled, and it may not be possible to avoid that. (Maybe unboxing would help?)
I'm not bothered by such a small slowdown; git-annex has been sped up a lot in recent times and it's ok to pay that. (Although I actually finessed it by not debugging cache hits.) But cumulative slowdowns from adding debugging lots of places that end up all being used in the same tight loop is still something to keep an eye on.
With that said, it should be easy to add additional debug info to most parts of git-annex now, so get in touch if you have something specific you want debug info for. I'm closing this todo here.
(See also, move readonly values to AnnexRead)