The Problem

Apparently, .gitattributes-based configuration (of e.g. numcopies, largefiles, addunlocked (not even implemented due to the inefficienty), etc.) is slow as every file needs to be queried individually for its attributes (git check-attr under the hood, I guess).

The Motivation

From a user's perspective, .gitattributes-based configuration has several benefits over the git annex --set annex.... approach:

  • .gitattributes can differ between branches
  • .gitattributes lists file name matches much more easily readable, while e.g. git annex --set annex.largefiles 'include=*.txt and include=*.md and include=*.bla and mimetype shenanigans and largerthan and whatnot...' gets confusing quickly.
  • .gitattributes nests well in subdirs, enabling quite concise and fine-grained control (e.g. all files in THAT folder should be annexed, but if I delete the folder at some point, nvm, my git config --get annex.largefiles won't stay cluttered with that path config)

Furthermore, Datalad relies on .gitattributes configuration to specify the backend and e.g. the text2git procedure

Suggestion

Couldn't the separate-git-tree-for-diffing-technique you used lately to speed up repeated imports be used to cache .gitattributes for all (or relevant) files in a git tree (e.g. have the same paths in that tree but file contents are the attributes), querying the attributes is a matter of quering this tree and updating them just requires re-querying the touched paths.

One problem I see with this tough is that it wouldn't be possible to cache the user's .git/info/attributes settings, which can change independently.