As seen in https://github.com/datalad/datalad/pull/6510 datalad is having to parse error messages to determine when nonexistant filenames were passed to git-annex. It would be better if there was some kind of API that it could use, since error messages can (and have) changed.
What would this API look like? It seems it could not be a special exit status of git-annex, because exit status 101 is already allocated for indicating when things like --size-limit and --time-limit have caused it to skip processing certian files.
Perhaps something like --json-error-unmatch
which would make it output
an additional line with a JSON record indicating that git ls-files
--error-unmatch had errored out.
Or --json
could enable this additional JSON record output.
The advantage would be that datalad would not need to check the git-annex
version to see if the option is supported. It would need to output a
JSON record even when git-ls-files did not fail. Perhaps something like
this:
{"git-ls-files exit status": 0}
{"git-ls-files exit-status": 1}
This does risk breaking things that parse the existing JSON and fall over on the new record, but I think git-annex should be free to add new records and fields to its JSON output in general, and it has probably at least added new fields before. --Joey
Although sounds logical, FWIW, not sure if DataLad would be robust to such an assumption since I don't think it was exercised before and indeed all returned records were typically associated with some particular
"path"
so our code might rely on that. Also it makes it a bit unclear on what datalad code should do with "unrecognized" outputs -- are they sign of a problem, or could be safely ignored?Ideally, IMHO and IIRC, DataLad should not concern itself with exit codes of
git
commands whichgit-annex
ran -- it should be up togit-annex
to decide on what any particular underlyinggit
invocation supposed to do etc. I guess only in the cases of some faulty, unknown to git-annex how to handle, behavior where git-annex itself reports an error, might be worth including details on underlying failedgit
invocation in the output, so datalad also could channel it to the user.But datalad is apparently concerning itself with just that, since there was a bug when git-annex's "not found" stderr changed.
I do think this needs to have a good motivation before it's implemented, and that's currently all I have, that datalad for some reason cares about it. Maybe there is not a good reason for datalad to care though..
You may well be right about adding new JSON objects breaking parsers.
I see that datalad parses the error messages to get the filename that caused it, and so an exit status is not enough information.
(See
./datalad/support/annex_utils.py
_get_non_existing_from_annex_output
)So the additional json would need to include the filename that didn't exist.
this closed todo argues persuasively that --json-error-messages cannot be extended to include these errors as-is. In particular, putting them in the current json would mean outputting a record without a "key" field in situations where that field is always present, so json consumers might misbehave without it.
bpoldrack makes a good point here:
So eg:
That seems about right to me, and it future proofs git-annex to be able to report other exceptions in json output later on.
What should enable that new json output for exceptions, --json-error-messages or a new option like --json-exceptions?
The risk is that, since this is a new object type, it breaks a json consumer that expects to find a
{"command":...}
object.I do think that git-annex should be free to add new fields to existing json objects. But adding a new object type seems considerably more risky. git-annex has never done that before (eg --json-progress outputs the "command" object with updated values).
I took a look at datalad's parsing of json from git-annex, to see what it might do if it got a new object type unexpectedly. I'm not convinced it wouldn't misbehave. For example, in
datalad/interface/results.py
annexjson2result
:Without a "status" field this will yield "error" which could go on to cause unexpected behavior
So, I think this would need to be a new --json-exceptions option.
Yes -- having some UNIQUEIDs for errors might be quite great!
As for specific field to identify what that error applicable to: the other changed behavior which broke our code/tests was the change to behavior of
annex info
to no longer list unknown to annex path in"file"
and rather list it in"input"
for which the PR was cooked up already. So may be here it also just should be"input"
field ifannex
desires to not make claims on either it is a"file"
or may be some "pathspec" (well -- in DataLad nobody was brave enough to open that -- path-vs-pathspec -- pandora box see issues/6933)?Just for completeness -- do we really need a new type of record for this? In immediate use case which is in the heart of the path escaping change discussion let's consider other commands. E.g.
addurl
already reports--json-errors
just fine in a similar caseso why just not to make
add
'sinto
or if you want to expand with unambigous ID:
or alike without requiring some new CLI option?
Oh well spotted yarikoptic! I wish I had noticed your comment 2 hours ago, but I was head down implementing --json-exceptions.
Ok, so
addurl
doesgiveup "bad url"
and that does indeed result in json output that lacks akey
and has"file":null
as well.I'm sure that somewhere in
git-annex add
, it's possible for it togiveup
with an error too. Oh of course... a device file!Other perhaps more likely cases where add can
giveup
include when it's unable to remove all write perms due to an xattr, and probably some permissions problems.So, json consumers of add already have to deal with the
key
being missing.Now, it might be that some git-annex commands don't
giveup
, so this would be a new complication for consumers of their json. But so would adding agiveup
for any reason to a command, and I don't worry about that.While I've implemented --json-exceptions, I don't like the complexity, so this new information inclines me to rip it back out and instead handle the case of the nonexistant file like:
Ok, implemented the simple alternative. Here's how it looks:
The errorid will remain stable. I can add those to other error messages now, on request BTW.
Note that when git-annex relies on
git ls-files --error-unmatch
to complain about nonexistant or non-git files, the error messages from git will still be displayed to stderr, not this nice json. So datalad will need to keep its parser for that part."error-messages"
, would it may be make sense to name new one"error-id"
and not"errorid"
for consistency?I decided to rename it "message-id" since it could go on warning or other messages too conceivably.