The Problem
Apparently, .gitattributes
-based configuration (of e.g. numcopies
, largefiles
, addunlocked
(not even implemented due to the inefficienty), etc.) is slow as every file needs to be queried individually for its attributes (git check-attr
under the hood, I guess).
The Motivation
From a user's perspective, .gitattributes
-based configuration has several benefits over the git annex --set annex....
approach:
.gitattributes
can differ between branches.gitattributes
lists file name matches much more easily readable, while e.g.git annex --set annex.largefiles 'include=*.txt and include=*.md and include=*.bla and mimetype shenanigans and largerthan and whatnot...'
gets confusing quickly..gitattributes
nests well in subdirs, enabling quite concise and fine-grained control (e.g. all files in THAT folder should be annexed, but if I delete the folder at some point, nvm, mygit config --get annex.largefiles
won't stay cluttered with that path config)
Furthermore, Datalad relies on .gitattributes
configuration to specify the backend and e.g. the text2git
procedure
Suggestion
Couldn't the separate-git-tree-for-diffing-technique you used lately to speed up repeated imports be used to cache .gitattributes
for all (or relevant) files in a git tree (e.g. have the same paths in that tree but file contents are the attributes), querying the attributes is a matter of quering this tree and updating them just requires re-querying the touched paths.
One problem I see with this tough is that it wouldn't be possible to cache the user's .git/info/attributes
settings, which can change independently.
I think that the most likely way to speed it up is for git-annex to include its own .gitattributes parser. It could then cache .gitattributes files, probably in memory for a single command would be sufficient.
Stracing
git check-attr --stdin
shows it has a small in-memory cache. And it is pretty fast. The problem is that its batch interface is not well suited to querying multiple different gitattributes, and the roundtrips through stdin are not very fast compared with what could be a very quick in-memory calculation.See ?annex.addunlocked in gitattributes for a currently rejected todo that it would probably make sense to revisit if this were implemented. That todo's comments also have some information about gitattributes query speed and other arguments in favor of supporting them for more stuff, including https://github.com/datalad/datalad/issues/5383#issuecomment-770108778.
I've always felt a gitattributes parser might be worth doing. But the pattern syntax used by git is pretty complicated, and it would need to imitate it perfectly.
Anyway, if a patch doing this landed in my inbox, (or someone wanted to fund a medium sized project), I think I'd accept it. So I'll confirm this todo.