Recent comments posted to this site:

Both problem cases involve git add'ing already annexed files. So if the new git add behavior could be limited to already-annexed files, these problem cases would be addressed, without creating the problems discussed above. Since already "git-annex abuses the fact that git provides the clean filter with the work tree filename, and reads and cleans the file itself", the work tree filename is known. Question is how to know, when git add calls ?git-annex-clean, which files are already-annexed.

"Suppose you have a mixture of unlocked files and files that are added directly to git, and you've modified several of them. Now, if you run git commit -a, you would surely hope that the annexed ones stay annexed and don't get committed directly to git. Well, git add . ; git commit is normally equivilant, so it should behave the same. It follows that git add does need to add some files to the annex." -- for the unlocked files, the version in the index would be the pointer file, so git-annex would know what they are.

"Suppose you have an unlocked file in your repo, and you rename it (not using git move), and then git add it." -- catching this requires keeping track of inodes of unlocked files. But since already "git-annex installs post-checkout, post-merge, and pre-commit hooks, which update the working tree files to make content from git-annex available", the hooks could do this, maybe with a Bloom filter? You'd only consult the Bloom filter if the git index entry isn't there and file matches annex.largefiles. Or maybe the inode info in the git index could be used.

Comment by Ilya_Shlyakhter Tue Oct 22 20:01:51 2019

Instrumenting, this is fed into git mktree:

160000 blob 6941fd9c7ad9640f75a02c993245b8de784105e1\tqux\NUL

So the problem is it's got the mode for a submodule, but the wrong type, blob.

Internally, git-annex has generated this, which seems wrong. It should be a TreeCommit.

TreeBlob (TopFilePath {getTopFilePath = "qux"}) 57344 (Ref "6941fd9c7ad9640f75a02c993245b8de784105e1")

(57344 = 160000 oct)

So need to debug the parsing of the input git tree next..

Comment by joey Tue Oct 22 19:43:21 2019
Reproduced using Felix's recipe, with current git-annex.
Comment by joey Tue Oct 22 19:30:57 2019

inode based clean filter for less surprising git add seems to open the door to adding such a config as this.

Although, if that were implemented, I suspect that demand for such a config might dry up..

Comment by joey Tue Oct 22 18:29:44 2019
"It is entirely reversable by undoing the git config changes" -- just to note, before undoing the git config changes, you'd want to lock any unlocked files (after first committing any pending changes to them).
Comment by Ilya_Shlyakhter Tue Oct 22 18:25:11 2019

Several commenters seem to be under the misapprehension that git add of a modified file that is stored in git before will annex the new version. It does not. That case is already handled, by git-annex noticing if the old file was annexed, and if not, letting git add it to git as usual (unless annex.largefiles is configured, in which case it uses that configuration).

Comment by joey Tue Oct 22 18:16:15 2019

maybe, git-annex could keep track of local unlocked files by inode, not just by path name?

That's an interesting idea. If it could be made to work well, I think it would address my concerns from comment 2 while freeing git add to otherwise behave however it might be desired to behave by the user.

I've expanded on the idea in [todo/inode_based_clean_filter_for_less_surprising_git_add]]

Thanks!

Comment by joey Tue Oct 22 17:49:43 2019

Ilya, by the time the pre-commit hook runs, git add would have already written the large file into the object file, so stuff like git gc would pay the price of it even if it were kept out of a commit.

In other words, that has the same problems that v5 unlocked files had when git add or git commit was run on them. I've seen plenty of users bitten by that with v5. Fixing that problem was a (minor) motivation for v7.

Comment by joey Tue Oct 22 17:45:43 2019

I am not convinced that making this configurable does anything other than give users a new way to shoot themselves in their foot, and adding another config setting I have to tease out of them in the ensuring bug report.

My comment on lets discuss git add behavior describes scenarios where such a git add behavior would result in accidentially adding large files to the git object store.

The only way that would seem to avoid that is a config setting that disables all support for populating annexed unlocked files, as well as making git-add not smudge them, so the repository is basically treated just like a v5 repository.

Comment by joey Tue Oct 22 17:23:54 2019

Can we please not use language like "hijacked" and "man in the middle attack" about this.

At least, not if you want me to engage constructively with this thread.

Comment by joey Tue Oct 22 17:14:24 2019