A couple of failing tests on DataLad's end pointed to a recent
change in git-annex creating commits on the git-annex branch despite
-c annex.alwayscommit=false
being specified. One of those tests
checks that, when we're doing a bulk operation adding metadata to
files, we don't generate many separate commits on the git-annex branch
for each git annex metadata
call. Based of off that, here's a
script that shows that -c annex.alwayscommit=false
isn't honored on
the second call to git annex metadata
.
set -x
cd "$(mktemp -d ${TMPDIR:-/tmp}/gx-XXXXXXX)"
git init && git annex init
>one && git annex add one && git commit -m'add one'
ga_head=$(git rev-parse git-annex)
git annex metadata -c annex.alwayscommit=false --set a=1 one
git annex metadata -c annex.alwayscommit=false --set b=2 one
git log --oneline --stat $ga_head..git-annex
git annex merge
git log --oneline --stat $ga_head..git-annex
[...]
+ git annex metadata -c annex.alwayscommit=false --set b=2 one
metadata one (recording state in git...)
a=1
a-lastchanged=2020-04-14@21-16-48
b=2
b-lastchanged=2020-04-14@21-16-48
lastchanged=2020-04-14@21-16-48
ok
+ git log --oneline --stat 8e7323181b1481aa6d8ed1059a46f8da29de0629..git-annex
84b987f update
...0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.log.met | 1 +
1 file changed, 1 insertion(+)
+ git annex merge
merge git-annex (recording state in git...)
ok
+ git log --oneline --stat 8e7323181b1481aa6d8ed1059a46f8da29de0629..git-annex
2cb57d6 update
...0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.log.met | 1 +
1 file changed, 1 insertion(+)
84b987f update
...0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.log.met | 1 +
1 file changed, 1 insertion(+)
Before eedd73b84 (fix reversion caused by earlier optimisation to
git-annex branch reads, 2020-04-10), the output of the first git log
call would instead be empty, and the second log call (after git annex
merge
) would show a single commit for the metadata changes.
I wasn't able to make much progress understanding eedd73b84 or the
commit it references, aeca7c220 (Sped up query commands that read the
git-annex branch by around 5%, 2020-04-09). The main change in
eedd73b84 is the addition of the else
branch, so not too
surprisingly I can restore the change in behavior by guarding the
else
branch with a condition that checks annexAlwaysCommit
(diff
below). But I don't have a clear enough understanding to write a
commit message that justifies that change or to be confident that this
change is good overall (though git annex test
did not report any
failures).
diff --git a/Annex/Branch.hs b/Annex/Branch.hs
index 712c5c10f..f248b3eb4 100644
--- a/Annex/Branch.hs
+++ b/Annex/Branch.hs
@@ -179,8 +179,9 @@ updateTo' pairs = do
- a commit needs to be done. -}
when dirty $
go branchref dirty [] jl
- , when dirty $
- lockJournal $ go branchref dirty []
+ , whenM (annexAlwaysCommit <$> Annex.getGitConfig) $
+ when dirty $
+ lockJournal $ go branchref dirty []
)
else lockJournal $ go branchref dirty tomerge
return $ not $ null tomerge
That change would lead to buggy behavior, because git-annex would then not stage the journal files, and the optimisation prevents it reading the journal files, so it would operate with out of date information.
The right fix, I think is to avoid making a commit, and instead only stage the journal files into the annex index.
But probably only when annex.alwayscommit=false and still commit when it's true, because leaving staged changes in the annex index without committing them risks git gc deleting the objects used, which is a documented gotcha with annex.alwayscommit=false but not something users should otherwise need to worry about.
Tried to implement that approach, but it has a problem: Once the journal gets staged into the index, the journal is no longer dirty. So, when git-annex is later run with annex.alwayscommit=true, it won't know it needs to commit the index to the branch. It would only do so after some other subsequent change is made. So
git annex merge
would not commit the staged changes as it does now.Solving that would need some other indication that the index has staged changes in it, eg a flag file or some comparison of index file timestamp with the git-annex branch ref timestamp. But whatever it is would need to be checked for when annex.alwayscommit=true. So git-annex would become slower in the common case to support annex.alwayscommit=false.
The slowdown would be much less than the win of the optimisation that caused this, but still I am not much of a fan of slowing things back down for an uncommon case.
(The only way around that, maybe, would be to leave a file in the journal after staging it, so git-annex treats the journal as dirty without needing to check anything else, and so knows it needs to commit the annex index later. But then the journal would be treated as dirty every time git-annex starts up, and so would be staged into the index. Which would make running with annex.alwayscommit=false slower. Maybe the file could have a special name that git-annex knows it does not need to stage? This feels like a lot of complexity being added.)
So, I'm now leaning more toward disabling the optimisation when annex.alwayscommit=false. The journal won't be staged to the index then, and git-annex will have to check for journalled changes as long as it's running with that configuration. No worse than it was before.
Found another problem with that optimisation: When annex.merge-annex-branches=false, the journal does not get staged or committed, but the optimisation assumes it always does. So, git-annex won't see any journalled changes then.
Thanks for the fixes, as well for providing a nice description of your analysis for others to follow along.