--batch
was added to annex add
6 years ago but only now we got to add usage of it in datalad (PR: 5722). Unfortunately on the initial attempt we got tests failing and looking into it I found that its behavior differs from non-batched annex add
when given directories. That is described in the original issue in Implementation notes as "made it not recurse into directories". Unfortunately, such a limitation forbids consistent use of annex add --batch
as drop-in replacement for plain annex add
without reimplementing git-annex logic on treating dotdirs etc to decide which files actually should be added to git-annex or not.
For that reason I think it would be great if add --batch
would gain the "super-power" to be able to handle directories as well.
edit: I forgot to add my lovely reproducer to show inconsistency "just in case":
$> bash annex-add-batch-dir.sh
> set -eu
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-zmItF0C
> git init
Initialized empty Git repository in /home/yoh/.tmp/dl-zmItF0C/.git/
> git annex init
init ok
(recording state in git...)
> mkdir d-cmd d-batch
> touch d-cmd/file d-batch/file
> git annex add d-cmd
add d-cmd/file
ok
(recording state in git...)
> echo d-batch
> git annex add --batch
> git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: d-cmd/file
Untracked files:
(use "git add <file>..." to include in what will be committed)
d-batch/
so annex add
and annex add --batch
behave differently on folder paths.
Comments seem clear that this should not be changed, and there are easy things users of --batch can do to recurse directories themselves.
Also I checked and the documentation of --batch for git-annex add (and other commands) does say that it reads a file from stdin, so there's at least that documentation that it does not support directories.
--json
mode. But in--json
mode I would expect this to be doable.The reason it does not recurse is that would make the number of lines/json records of output vary depending on the number of files in the directory. Which would complicate parsing git-annex's output.
git-annex add
is not actually special here. You can't pass a directory togit-annex drop --batch
either.It seems to me that it's not hard to recurse directories yourself, and so it's better to offload the need to do that onto users of --batch, rather than making them deal with increased parsing complexity.
Even in --json mode there would be difficulties. Consider the case of an empty directory or a directory that is all gitignored. If you tried to read a response after sending such a directory, and it did not output anything, you'd block.
I wish we were talking about a "new functionality" (as if in times of adding
add --batch
) -- then may be--json
output could introduce some new semantic foradd --json
to respond with smth likethus signaling with
"file": []
that no files were added for thatinput
. For the files which were added, it would provide one response per file (cannot populate "file" since can be too long)reflecting on my initial argument for this feature: is there git-annex interface to announce which paths for a folder git-annex would consider to be added (in case of
add
) or dropped (in relation to comment thatdrop
cannot do that either)? without that IMHO it is unfair to require client to somehow duplicate logic of git-annex on what files/paths it would consider for that specific operation.ok, I am doomed to retract the "unfairness" argument since now there is
annex add --dry-run
, so something likewould work.
I do think there is value in the simplicity of the current json, even if a "better" one could be constructed if we did not need to worry about backwards compatability. A new JSON object that was only seen when recursing a directory would need to be documented, otherwise a JSON user would be likely to only implement support for the JSON objects they did see.
git-annex uses git plumbing to handle this, so it's easy to do very close to the same thing:
For most git-annex commands except
add
, you can get a list of files withgit ls-files --cached
. That will include annexed files and other files, but of course commands likedrop
will skip the non-annexed files and that can be handled with the existing--batch
interface.For
add
, usegit ls-files --others --exclude-standard
(For
add
, it also looks atgit ls-files --modified
, but you only need that if you want to add files that got modified.)ls-files
etc to first identify the files given the user-provided list of paths, so I do not expect that code to see directories that often