Please describe the problem.
References of the struggle with more background: - https://github.com/datalad/datalad/issues/7608 - https://github.com/datalad/datalad/issues/7609
What steps will reproduce the problem?
$> (set -e; cd /tmp/; rm -rf ds002144*; git clone http://github.com/OpenNeuroDatasets/ds002144 ; cd ds002144; git fsck; mkdir /tmp/ds002144-2; (cd /tmp/ds002144-2; git init; git annex init; ); git remote add --fetch datalad-public /tmp/ds002144-2; git fsck; git annex merge; git fsck; )
What version of git-annex are you using? On what operating system?
10.20240430+git26-g5f61667f27-1~ndall+1%
Please provide any additional information below.
$> (set -ex; cd /tmp/; rm -rf ds002144*; git clone http://github.com/OpenNeuroDatasets/ds002144 ; cd ds002144; git fsck; mkdir /tmp/ds002144-2; (cd /tmp/ds002144-2; git init; git annex init; ); git remote add --fetch datalad-public /tmp/ds002144-2; git fsck; git annex merge; git fsck; )
...
+/bin/zsh:80> git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (4759/4759), done.
+/bin/zsh:80> git annex merge
(merging datalad-public/git-annex into git-annex...)
(recording state in git...)
Remote origin not usable by git-annex; setting annex-ignore
http://github.com/OpenNeuroDatasets/ds002144/config download failed: Not Found
merge git-annex ok
+/bin/zsh:80> git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (4759/4759), done.
broken link from tree 4089998623737d39cd3f5d6fdfa89b164898e464
to tree ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e
broken link from tree 8ba58233cd121b97d5c918a6ba7c3a8c56fd38b1
to tree b78b723042e6d7a967c806b52258e8554caa1696
missing tree b78b723042e6d7a967c806b52258e8554caa1696
missing tree ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
there are good and there are some bad days
Reproduction recipe works, thanks!
Happens back to 10.20240129 at least, this is not recent breakage.
There are some interesting things in the git-annex history. Including some git-annex:export.tree grafting, and also a continued transition.
I made a new empty repo, initialized and annexed some files. Running the same script but cloning that, this problem does not occur. I also tried exporting a tree in that repo, and still the problem doesn't occur. I even tried running
git-annex forget
in there and still can't cause the problem.So something about this specific repo's git-annex branch history is triggering the problem and I don't know what. I've archived the current state of this repo in my big repo as git-annex-test-repos/ds002144.tar.gz to make sure I can continue to reproduce this.
The first git-annex branch commit that is missing its tree object is a git-annex:export.tree graft commit. That is 3 commits above the git-annex branch pulled from github:
Very interesting. Especially since the point of those export.tree graft commits are to make sure that the exported tree objects are referenced and so don't get gced out from under us.
Resetting the repo's git-annex branch all the way back to the 1st commit in it is sufficient to reproduce this bug.
Hmm. That ref contains an export.log that references some tree shas.
Those seem familiar:
So ok.. We have here a transition that forgot git history. But it kept an export.log that referenced 2 trees in that now-forgotten git history.
Everything else seems to follow from that. Grafting those trees back into the git-annex branch in order to not forget them is a bad move since they're already forgotten. So it could just avoid doing that, if the tree object is missing, I suppose.
There might be a deeper bug though: If we want to
git-annex export
, in either the original repo with forgotten history, or in a clone, it won't be able to refer to those tree objects. So it won't know what has been written to the special remote. So eg, if we export a tree that deletes a file compared to one of these trees, it wouldn't delete the file from the special remote. I think this problem might not happen when exporting in the original repo, because there the export database also records the same information. More likely it will happen in a clone.So, action items:
git-annex export
needs to refuse to touch the affected special remote, or warn the user that it's lost track of what files were sent to the special remote.Occurs to me that one way to get a repository into this situation would be to do a
git-annex export
, thengit-annex forget
, and then manually reset the git-annex branch togit-annex^^
(or similarly pushgit-annex^^
to origin).There is a commit after the transition commit that re-grafts the exported tree back into the git-annex branch, and a manual reset would cause exactly this situation.
I doubt OpenNeuro is manually resetting the git-annex branch when creating these repos, but stranger things have happened...
Spotchecked a few other OpenNeuro datasets in the same numeric range and they seem ok, so this may have been a 1-off problem. It would be good to check all 1.1k datasets.
performTransitionsLocked, when
neednewlocalbranch = True
, first writes the new git-annex branch, and then calls regraftexports which adds a second commit onto it.In the window before regraftexports finishes, interrupting git-annex will leave the repository in this state.
There may be some other way this could happen, but that seems like a likely cause. It needs to avoid updating the git-annex branch ref until it's grafted the exports into it.
Decoding the export.log, we have these events:
Tue Aug 4 13:44:10 2020 (PST): An export is run on an openneuro worker sending to
s3-PRIVATE
, of b78b723042e6d7a967c806b52258e8554caa1696 which is now lost to history. After that export completed, there was a subsequent started but not completed export of ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e, also lost to history.Fri Jan 19 21:04:26 2024: An export run on the same worker, sending to a
s3-PUBLIC
(not the current one, one that has been marked dead and forgotten), of ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e. After that export completed, there was a subsequent started but not completed export of 28b655e8207f916122bbcbd22c0369d86bb4ffc1.Later the same day, an export run on the same worker, sending to
s3-PUBLIC
(the current one), of 28b655e8207f916122bbcbd22c0369d86bb4ffc1. This export completed.Interesting that two exports were apparently started but left incomplete. This could have been because git-annex was interrupted, which would go a way toward confirming my analysis of this bug. But also possible there was a error exporting one or more files.
According to Nell, the git history of main was rewritten to remove a large file from git. The tree 28b655e8207f916122bbcbd22c0369d86bb4ffc1 appears to still contain the large binary file. No commit in main references it. It did get grafted into the git-annex branch which is why it was not lost.
Fixed performTransitionsLocked to create the new git-annex branch atomically.
Found another way this could happen, interrupting
git-annex export
after it writes export.log but before it grafts the tree into the git-annex branch. Fixed that one too.So hopefully this won't happen to any more repositories with these fixes. Still leaves the question of how to recover from the problem.
Note that at least in the case of ds002144, its git-annex branch does not contain grafts of the missing trees. The grafts only get created in the clone when dealing with a transition.
So, it seems that to recover from the problem, at least in the case of this repository, it will be sufficient for git-annex to avoid regrafting trees if the object is missing.
Done that, and so I suppose this bug can be closed. I'd be more satified if I knew how this repository was produced though.
While I don't think this affects the ds002144 repository (because the repository with the missing tree is dead), here's what happens if the export.log's tree is missing, master has been reset to a previous tree, which was exported earlier, and in a clone we try to get a file that is present in both trees from the remote:
Note that the "bad object" message only appears the first time run. Afterwards it only says "unknown export location".
Even if the tree object later somehow gets pulled in, it will keep failing, because the exportdb at this point contains the tree sha and it won't try to update from it again.
To recover from this situation, the user can make a change to the tree (eg add a file), and export. It will complain one last time about the bad object, and then the export.log gets fixed to contain an available tree. However, any files that were in the missing tree that do not get overwritten by that export will remain in the remote, without git-annex knowing about them. If the remote has importtree=yes, importing from it is another way to recover.