A case where file tracked by git unexpectedly becomes annex pointer file

In datalad test builds with git-annex 7.20191114+git43-ge29663773, one of the new test failures is due to an unexpectedly dirty repository (related datalad issue). The dirty status comes from a file that was tracked in Git switching over to an annex pointer file. Here's a script that distills enough of the test to trigger the failure on my end.

#!/bin/sh

set -eu

assert_clean () {
    if test -n "$(git status --porcelain)"
    then
        printf "\n\nUnexpectedly dirty:\n" >&2
        git status >&2
        git diff >&2
        exit 1
    fi
}

cd "$(mktemp -d --tmpdir gx-pointer-dirty-XXXXXXX)"
git init && git annex init

printf content-git >file-git
git -c annex.largefiles=nothing annex add -- file-git
git commit -m'file-git added'
assert_clean

printf content-annex >file-annex
git -c annex.largefiles=anything annex add -- file-annex
git commit -m'file-annex annexed'
assert_clean

On Travis as well as my local machine, the failure is intermittent, but seems to happen much more often than not. In the failing case, the last assert_clean call shows:

Unexpectedly dirty:
On branch master
Changes not staged for commit:
        modified:   file-git

no changes added to commit
diff --git a/file-git b/file-git
index d1c416a..b41ca32 100644
--- a/file-git
+++ b/file-git
@@ -1 +1 @@
-content-git
\ No newline at end of file
+/annex/objects/SHA256E-s11--726732d25826965592478fcc7c145d5a10fa1aa70c49fe3a4f847174b6d8889c

I see the failure with git-annex built from the latest master b962471c2 (2019-12-12). Bisecting against the git-annex repo (with a commit being marked "bad" if there was a failure within ten runs of the above script), points to ec08b66bd (shouldAnnex: check isInodeKnown, 2019-10-23) as the first bad commit. Just looking at the topic of the commit, that result seems plausible to me.

Other details

My git version 2.24.1 and locally I'm building git-annex through guix. On the failing Travis run, git-annex 7.20191114+git43-ge29663773 came from neurodebian, and the git version was 2.24.0.

Hopefully the script above is sufficient to trigger the issue on your end. Thanks for having a look.

fixed --Joey

RSS Atom

comment 1

The title makes it sound like a work tree file gets replaced with a dangling pointer file, which is not the case. A worktree file that was not annexed is is being added to the annex, if you choose to commit that state.

For whatever reason, git becomes confused about whether this file is modified. I seem to recall that git distrusts information it recorded in its own index if the mtime of the index file is too close to the mtime recorded inside it, or something like that. (Likely as a workaround for mtime granularity issues with various filesystems.) Whatever the reason, git-annex is not involved in it; it will happen sometimes even when git-annex has not initialized the repo and is not being used.

It's not normally a problem that git gets confused or distrusts its index or whatever, since all it does is stat the file, or feed it through the clean filter again, and if the file is not modified, nothing changes.

Why does the clean filter decide to add the file to annex in this case?

Well, because this is all happening inside this:

git -c annex.largefiles=anything annex add -- file-annex

And there you've told it to add all files to the annex with annex.largefiles=anything. So it does.

To complete the description of what happens: git-annex add runs git add on the file-annex symlink it's adding. git add file-annex, for whatever reason, decides to run the clean filter on file-git. The annex.largefiles=anything gets inherited through this chain of calls.

While the resulting "change" does not get staged by git add (it was never asked to operate on that file), the clean filter duly ingests the content into the annex, and remembers its inode. So when the clean filter later gets run by git status, it sees an inode it knows it saw before, and assumes it should remain annexed. (This is why the commit that checks for known inodes was fingered by the bisection.)

Note that, you can accomplish the same thing without setting annex.largefiles, assuming a current version of git-annex:

git add file-git
git annex add file-annex

I think the only reason for setting annex.largefiles in either of the two places you did is if there's a default value that you want to temporarily override?

Also, just touching file-git before the annex.largefiles=anything operation causes the same problem, again git-annex add runs git add file-annex, which runs the clean filter on file-git, which this time is legitimately modified.

Possible ways to improve this short of improving git's behavior:

git annex could set annex.gitaddtoannex=false when it runs git add. Since git-annex never relies on git add adding files to the annex, that seems entirely safe to always do (perhaps even when running all git commands aside from git-annex commands of course). But, that would not help with a variant where rather than git-annex add, this is run:

git -c annex.largefiles=anything add file-annex

The clean filter could delay long enough that git stops distrusting its index based on timestamps. A 1 second sleep if the file's mtime is too close to the current time works; I prototyped a patch doing that. But, that does not deal with the case mentioned above where file-git gets touched or legitimately modified.

The clean filter could check if the file is already in the index but is not annexed, and avoid converting it to annexed. But that would prevent legitimate conversions from git to annexed as well, which rely on the same kind of use of annex.largefiles.

Temporary overrides of annex.largefiles could be ignored by the clean filter. Same problem as previous.

So, I think that fixing this will involve adding a new interface for converting between git and annexed files that does not involve -c annex.largefiles. That plus having the clean filter check for non-annexed files seems like the best approach.

Comment by joey — Thu Dec 26 16:56:38 2019

Remove comment

comment 2

On second thought, making the clean filter check for non-annexed files would prevent use cases like annex.largefiles=largerthan(100kb) from working as the user intended and letting a small file start out non-annexed and get annexed once it gets too large. Users certianly rely on that and this bug that only affects an edge case does not justify breaking that.

What would work to make the clean filter detect when a file's content has not changed, though its mtime (or inode) has changed. In that case, it's reasonable for the clean filter to ignore annex.largefiles and keep the content represented in git however it already was (non-annexed or annexed).

To detect that, in the case where the file in the index is not annexed: First check if the file size is the same as the size in the index. If it is, run git hash-object on the file, and see if the sha1 is the same as in the index. This avoids hashing any unusually large files, so the clean filter only gets a bit slower.

And when the file in the index is annexed, check if the file size is the same as the size of the annexed key. If it is, verify if the file content matches the key. (typically be hashing). Cases where keys lack size or don't use a checksum could lead to false positives or negatives though. Although, I've not managed to find a version of this bug that makes an annexed file get converted to git unintentionally, so maybe this part does not need to be done?

Or.. Since the root of the problem is temporarily overriding annex.largefiles, it could just be documented that it's not a good idea to use -c annex.largefiles=anything/nothing, because such broad overrides can affect other files than the ones you intended. (And since the documented methods of converting files from annexed to git and git to annexed use such overrides, that documentation would need to be changed.)

Comment by joey — Fri Dec 27 06:22:23 2019

Remove comment

comment 3

A variant of this where an annexed unlocked file is added first, then the file is touched, and then some other file is added with -c annex.largefiles=nothing does result in the clean filter sending the whole annexed file content back to git, rather than keeping it annexed. For whatever reason, git does not store that content in .git/objects or update the index for that file though, so it doesn't show up as a change.

So apparently that variant is only potentially an expensive cat of a large annexed file, and does not need to be dealt with. Unless git sometimes behaves otherwise.

Comment by joey — Fri Dec 27 17:11:42 2019

Remove comment

comment 4

It's almost possible to get the same unwanted conversion without any git races:

echo content-git > file-git
sleep 2
git add file-git
git commit -m add

echo foo > file-git
echo content-annex > file-annex
git -c annex.largefiles=anything annex add file-annex

In this case, git currently does not run the modified file-git through the clean filter in the last line, so the annex.largefiles=anything doesn't affect it.

But, as far as I can see, there's nothing preventing a future version of git from deciding it does want to run file-git through the clean filter in this case.

I am not going to try to prevent against such a thing happening. As far as I can see, anything that the clean filter can possibly do to avoid such a situation will cripple existing uses cases of annex.largefiles, like largerthan() as mentioned above. The user has told git-annex to annex "anything", and if git decides to run the clean filter while that is in effect, caveat emptor.

Which is not to say I'm not going to fix the specific case this bug was filed about. I actually have a fix developed now. But just to say that setting annex.largefiles=anything/nothing temporarily is a blunt instrument, and you risk accidental conversion when using it, and so it would be a good idea to not do that.

One idea: Make git-annex add --annex and git-annex add --git add a specific file to annex or git, bypassing annex.largefiles and all other configuration and state. This could also be used to easily switch a file from one storage to the other. I'd hope the existence of that would prevent one-off setting of annex.largefiles=anything/nothing. ?git annex add option to control to where

Comment by joey — Fri Dec 27 18:41:12 2019

Remove comment

comment 5

Thanks for the explanation and the fix.

For whatever reason, git becomes confused about whether this file is modified. I seem to recall that git distrusts information it recorded in its own index if the mtime of the index file is too close to the mtime recorded inside it, or something like that.

I see. I think the problem and associated workaround you're referring to is described in git's Documentation/technical/racy-git.txt.

Note that, you can accomplish the same thing without setting annex.largefiles, assuming a current version of git-annex:

git add file-git git annex add file-annex

I think the only reason for setting annex.largefiles in either of the two places you did is if there's a default value that you want to temporarily override?

Right. DataLad's methods that are responsible for calling out to git annex add have a git={None,False,True} parameter. By default (None), DataLad just calls git annex add ... and let's any configuration in the repo control whether the file goes to git or is annexed. But with git=True or git=False, the annex add call includes a -c annex.largefiles= argument with a value of nothing or anything, respectively.

But just to say that setting annex.largefiles=anything/nothing temporarily is a blunt instrument, and you risk accidental conversion when using it, and so it would be a good idea to not do that.

Noted. As mentioned above, DataLad's default behavior is to honor the repo's annex.largefiles configuration. And the documentation for datalad save, DataLad's main user-facing entry point for annex add, recommends that the user configure .gitattributes rather than using the option that leads calling annex add with -c annex.largefiles=nothing.

One idea: Make git-annex add --annex and git-annex add --git add a specific file to annex or git, bypassing annex.largefiles and all other configuration and state. This could also be used to easily switch a file from one storage to the other. I'd hope the existence of that would prevent one-off setting of annex.largefiles=anything/nothing.

As far as I can see, those flags would completely cover DataLad's one-off setting of annex.largefiles=anything/nothing. They map directly to DataLad's git=False/True option described above. So, from DataLad's perspective, they'd be very useful and welcome.

Comment by kyle — Sat Dec 28 21:06:46 2019

Remove comment

comment 6

I've added git-annex add --force-large and --force-small, which would be good to use to avoid this kind of too-broad overriding problem in the future.

Comment by joey — Wed Jan 1 17:41:13 2020

Remove comment

Add a comment