Please describe the problem.
our datalad tests started to fail recently (in this PR is the effort to troubleshoot etc).
Here is what we see with recent version using such simple script:
#!/bin/bash
export PS4='> '
set -eu
set -x
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
git init
git annex init
n='gl\orious'
# touch "$n"
git annex add --json --json-error-messages "$n"
that now
❯ ( source /home/yoh/git-annexes/10.20230407+git63-g3d1d77a1bb.env ; bash escaped.sh )
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-OAXQ1CE
> git init
Initialized empty Git repository in /home/yoh/.tmp/dl-OAXQ1CE/.git/
> git annex init
init ok
(recording state in git...)
> n='gl\orious'
> git annex add --json --json-error-messages 'gl\orious'
git-annex: "gl\\orious" not found
add: 1 failed
so we get \\
instead of \
in the output printed by git-annex
previously was all fine
❯ ( source /home/yoh/git-annexes/10.20230407.env ; bash escaped.sh )
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-1TzrWdi
> git init
Initialized empty Git repository in /home/yoh/.tmp/dl-1TzrWdi/.git/
> git annex init
init ok
(recording state in git...)
> n='gl\orious'
> git annex add --json --json-error-messages 'gl\orious'
git-annex: gl\orious not found
add: 1 failed
The next release is going to escape and quote filenames that contain special characters similarly to how git does. (But json will not be affected due to already being escaped.)
This will affect filenames output in error messages. So if you are parsing error messages or non-json output with filenames that contain characters that need to be escaped, you will need to deal with the change.
See terminal escapes in filenames for the full details of this change.
One thing I noticed about your example is that git-annex add doesn't display that particular filename the same as git add does:
But, that is an inconsistency in git itself. More commonly it uses the same display as git-annex for this filename:
So I don't think there's a problem with git-annex's behavior here. With that said, we can talk about adding something to make back-compatability easy for you, or whatever. An config like core.quotePath but that also affects special characters, not just unicode, for example.
hm, I didn't look inside
git
butgit diff
is likely to have it escaped becausepatch
(and/or other unified diff operating tools) expect it such. In other words --git diff
must encode paths escaped because the "diff standard" expects it such.On the other hand, as you confirmed,
git add
just displays the name on the screen, and as such it does not bother escaping it since may be I just cut/paste it as a string which is "raw" and thus not expecting any escape characters.RTFMing git-config on core.quotePath I spotted
so it talks about double-quotes.
git
status
,diff
report paths in double ("
) not single ('
) quotes. I wonder if that is where/howgit
is consistent since in your example that is the difference too:that git uses
'
(and does not escape) while git annex uses"
(and escapes)? Did you see git doing escaping in paths where it reports them within single ('
) quotes?and thus git-annex should have just wrapped in
'
to become consistent with git in :git escapes filenames like this extensively:
This message from
git add
escapes slightly differently, but it still escapes some characters:Git only does this type of escaping when displaying a fatal error (it's
vreportf
in the git source, used by things likedie
). It's basically a last-ditch filtering of a string, which may contain a filename or other untrusted data, to avoid displaying escape characters. git-annex does contain such a last-ditch filtering too (safeOutput) but type safety let me avoid needing to use it to handle this filename here. I don't think it's at all necessary for git-annex to be bug-for-bug equivilant with git in its display of error messages; but it is important that it escape somehow. Git's double-quoted escaping is documented, and this other escaping is not.Since either behavior would be a behavior change from before when git-annex didn't escape the filename in the error message with either method, it seems to me either one would likely break your assumption. So I don't know why you're arguing for one way over the other way.
I am "arguing" because ideally I prefer not to handle some not quite standardized un-escaping.
_SSH_ESCAPED_CHARACTERS = '\\#&;
|*?~<>^()[]{}$\'" '`. The same here?add --json --json-error-messages
to produce a proper machine readable json record for "unknown" input, or there is a reason why there is no json record here in--json
mode?Also I just wanted to make sure that we are not missing some aspect, like what I felt was an unnoticed (to me at least) difference between
'
and"
escaping methods, and possibly some original reason on whygit
has them different in those cases -- may be there is some good reason?I dug into your code, and datalad is parsing git-annex's (and perhaps git's in some cases) stderr to find error messages like this one for files that don't exist, and then it internally dummies up something as if git-annex were outputting a --json-error-messages record for the file. See
./datalad/support/annex_utils.py
_get_non_existing_from_annex_output
Ok, I can understand now how needing to do an additional form of unescaping on top of that existing pain point would cause the reaction I have seen in this bug report.
api for telling when nonexistant or non git files passed is a todo item I opened the last time I became aware of this error message parsing ugliness. (Also relevant is this closed todo where I discuss why --json-error-messages cannot include these errors as-is.)
So the choice is between implementing api for telling when nonexistant or non git files passed and changing datalad to use that. Or adding a git config that avoids escaping filenames. The latter would be easy to do (and easier for datalad to use), but it kicks the can down the road. Datalad parsing error messages would continue to be a problem going forward. (Imagine if git-annex gets localized error messages..)
Note that I should avoid releasing git-annex with the added escaping until this is addressed.
api for telling when nonexistant or non git files passed is implemented now.
In datalad, all you should need to do now is check for a json object with
errorid:"FileNotFound"
and thefile
field is the name of the file.Note that the parser for error messages like "did not match any file(s) known to git" from
git ls-files --error-unmatch
will still be needed in datalad.I'm going to leave this open as a git-annex release blocker until the necessary changes get made to datalad.
Also yarik mentioned this change https://github.com/datalad/datalad/pull/7372/commits/45ddd4b12ff637c6c77e982225c0e9d9eb53c1b6 which was caused by a0e6fa18eb3c16c3c8079bb41c18151e6ea8b554, which was part of my series for escaping control characters.
I think git-annex needs to be returned to the old behavior there, even though
git-annex info dne
is not technically operating on a file when it doesn't exist.Update: Fixed this.
This has been blocking git-annex release for the past month.
Apparently datalad has mostly been updated now. https://github.com/datalad/datalad/pull/7372 is still not merged, but I'm not clear if the 2 failed tests there are caused by this change or something else.
I'm feeling now that I've waited long enough for datalad. I've closed this bug and don't consider it as blocking release.