Please describe the problem.
From our inspection, annex metadata
call relies on ls-files
to determine if the file is known to annex. But if there is an addurl --batch
running which is adding those files, git
is not immediately aware of them, thus git annex metadata
returns "peacefully" without actually performing any operation on metadata.
What steps will reproduce the problem?
this is an extract from what we are doing in datalad, so might be insufficient (let us know if so)
- run
git annex addurl --batch
and add a file - run
git annex metadata
to assign some metadata to the just addurl'ed file
What version of git-annex are you using? On what operating system?
6.20180316+gitg308f3ecf6-1~ndall+1
Please provide any additional information below.
$> git -c receive.autogc=0 -c gc.auto=0 annex --debug metadata --json -s verbal_iq+=116 -s performance_iq+=89 -s age+=16.77 -s handedness+=ambidextrous -s site_id+=pitt -s session_count+=1 -s dsm_iv_tr+=autism -s sex+=Male -s project+=abide_initiative -s full_iq+=103 -s MRI+=yes -s diagnosis+=autism -s participant_id+=50002 -s session_id+=1 -s species+=homo-sapiens -- sub-50002/ses-1/T1_rep-0.mgz [2018-03-27 11:25:15.953624157] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--cached","-z","--","sub-50002/ses-1/T1_rep-0.mgz"]
[2018-03-27 11:25:15.957761277] process done ExitSuccess
# no error exit code
$> px | grep addurl.*batch
yoh 26767 0.0 0.0 17892 2184 pts/16 T 11:05 0:00 | \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu/gconv:/usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu/audit:/usr/lib/git-annex.linux//etc/ld.so.conf.d:/usr/lib/git-annex.linux//lib64:/usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu:/usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c receive.autogc=0 -c gc.auto=0 annex addurl --with-files --json --batch
yoh 26790 0.0 0.3 1074180296 63604 pts/16 Tl 11:05 0:00 | \_ /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu/gconv:/usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu/audit:/usr/lib/git-annex.linux//etc/ld.so.conf.d:/usr/lib/git-annex.linux//lib64:/usr/lib/git-annex.linux//usr/lib/x86_64-linux-gnu:/usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex addurl --with-files --json --batch
P.S. It might be a related observation that git-annex metadata does exit with non-0 exit code whenever it is ran on a non-existing file, but it exits with 0 exit code (but without performing any action) when ran on the existing but not known to git file. I wondered if there could be if not a change in behavior but a flag to make annex metadata PATHs
exit with non-0 code if it didn't handle some path(s) from the provided. Then we could use it within our "set metadata" to guarantee that we do not omit any file for which we thought we would get metadata operation performed.
annex.skipunknown false will make git-annex error out in this situation. That will become the default in a couple of years, but can be set already by those who don't like the behavior of skipping.
In the case of addurl --batch though, do see my first comment for a way to avoid any errors.
I guess you expected annex.queuesize to be set to 1. However, that would mean every single file that addurl adds needs the whole git index file to be rewritten so other commands can immediately know about it. Which could be very slow, which is why that is not the default.
What you can do is use
git annex addurl --batch --json
and observe the key that it reports it's added. Then pass that key intogit annex metadata --batch --json
to add metadata to the key, which will work before the file ever gets added to the git index, and much more efficiently than relying on the index.(Pretty sure this has came up before and that I suggested the same thing then.)
I think if there's a bug here, it's entirely about git-annex's behavior when passed a non-annexed file, or a file that is not checked into git, of silently skipping the file.
Users are fairly frequently surprised by that.
(See also [bugs/unlock_should_warn_if_file_isn39t_in_repo]] and probably others that have been closed or handled in the forum.)
What git commit does is, if a file/directory exists but is not in git:
On the other hand "git commit $dir" just ignores such files as it recurses the directory tree (as long as something in the directory tree is known to git).
That would be fairly reasonable behavior for git-annex to have too. But it would be a behavior change. If someone is used to "git annex get foo*" getting all annexed files, but skipping the "foo~" temp file that is not in git, then they would have to change scripts and workflows.
Implementing it may be as simple as passing --error-unmatch to git ls-files. (And disable git-annex's code that checks for parameters that are not existing files.)
It could be an option, but I don't really consider an option as fixing the surprising behavior. And once you know git-annex behaves this way, I think it's rarely surprising and so the benefit of having an option may not justify having an option. I'd rather remove surprising behavior, if possible, than add an option to paper over it.
I think this would need a transition plan. Eg:
Also, it would need to be decided what to do about files that are checked into git but are not annexed files. It seems to make sense for git-annex get and drop of "foo*" to ignore "foo.txt" that is not annexed. But what about git-annex metadata on the same file? Could be argued that is throwing away the provided metadata, so should error, the same as if the file was not checked into git at all. I don't like the behavior varying between subcommands, so if metadata should error, so should get and drop.
There's code that currently skips those files and could error. It would need to remember the input list of files, and check if the non-annexed file was explicitly listed.
Oh, but what about the case where the non-annexed file is in a directory and the directory is explicitly listed and contains no other annexed files? Seems it ought to error for consistency there, but not if the directory does contain another file that is annexed, in addition to the one that is not. Implementing an error there seems to need a hash containing all the passed filenames, and then files can be deleted from it as it finds annexed files they expanded to, and at the end it can error out about any others. Kind of ugly and the hash lookups for each file would slow things down some.
I think that users are less frequently bitten by git-annex ignoring non-annexed files though.
Implemented annex.skipunknown git config, that will make it error out when given a file that git doesn't know about.
Not default yet, will be in a couple of years. complete annex.skipunknown transition in 2022
As to git-annex skipping non-annexed files, I'm leaning toward keeping it the way it is, and it's not really the subject of this bug report, except maybe that it's not entirely consistent with the annex.skipunknown behavior for non-git files. If users complain about it, I'll consider it again.