Please describe the problem.
This has happened a few times now which is why I'm reporting it. Git-annex seems to do something that causes my git repo to randomly corrupt on their own which is very worrying because I need to trust these repos to keep their content safe. It has never happened with regular repos and happens on both my Linux machine and my MBP. It hasn't happened on my NAS (SOTERIA) yet though which is exclusively accessed remotely, so I suspect this is an issue with the assistant/daemon.
What steps will reproduce the problem?
Just have the assistant running, committing and syncing. No idea what other factors migh play into this.
What version of git-annex are you using? On what operating system?
git-annex version: 8.20210428
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.28 DAV-1.3.4 feed-1.3.2.0 ghc-8.10.4 http-client-0.6.4.1 persistent-sqlite-2.11.1.0 torrent-10000.1.1 uuid-1.3.14 yesod-1.6.1.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
local repository version: 8
https://github.com/Atemu/nixpkgs/tree/498831397e77a265c240cf8f8a7d15e738f2f05b
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
[2021-05-09 17:55:13.653645489] Committer: Committing changes to git
(recording state in git...)
[2021-05-09 17:55:13.702843262] Pusher: Syncing with SOTERIA
To ssh://192.168.101.24/~/Annex/Documents.git/
041f52d19..40cc59126 master -> synced/master
git-annex: internal error: evacuate: strange closure type 4325399
(GHC version 8.10.4 for x86_64_unknown_linux)
Please report this as a GHC bug: https://www.haskell.org/ghc/reportabug
(recording state in git...)
(scanning...) ControlSocket .git/annex/ssh/atemu@192.168.101.24 already exists, disabling multiplexing
(started...)
(recording state in git...)
Unpacking all pack files.
remote: error: cannot lock ref 'refs/heads/synced/git-annex': is at 9166c83718695df3021d5bbf4fef2297d3f4cc84 but expected 7f761edbbe3c2a4710c629e9e7fdfb730c0639e7
remote: error: cannot lock ref 'refs/heads/synced/master': is at 42cab466f705c2496fb19249e272487bceb38808 but expected 40cc591261675e872f79b6a9ea966215d3f73581
To ssh://192.168.101.24/~/Annex/Documents.git/
! [remote rejected] git-annex -> synced/git-annex (failed to update ref)
! [remote rejected] master -> synced/master (failed to update ref)
error: failed to push some refs to 'ssh://192.168.101.24/~/Annex/Documents.git/'
To ssh://192.168.101.24/~/Annex/Documents.git/
7f761edbb..9166c8371 git-annex -> synced/git-annex
40cc59126..42cab466f master -> synced/master
fatal: early EOF
git-annex: internal error: evacuate: strange closure type 4325399
(GHC version 8.10.4 for x86_64_unknown_linux)
Please report this as a GHC bug: https://www.haskell.org/ghc/reportabug
(recording state in git...)
ControlSocket .git/annex/ssh/atemu@192.168.101.24 already exists, disabling multiplexing
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
error: Could not read 1f2ab777cdd4eb9add96d42de5022bde0a4a9a8b
error: Could not read 0f74453f84ff23d31b00a9bc9c331c00988f465a
error: Could not read 0f74453f84ff23d31b00a9bc9c331c00988f465a
error: Could not read 1f2ab777cdd4eb9add96d42de5022bde0a4a9a8b
error: Could not read 0f74453f84ff23d31b00a9bc9c331c00988f465a
error: Could not read 1f2ab777cdd4eb9add96d42de5022bde0a4a9a8b
error: Could not read 0f74453f84ff23d31b00a9bc9c331c00988f465a
error: Could not read 1f2ab777cdd4eb9add96d42de5022bde0a4a9a8b
error: Could not read 0f74453f84ff23d31b00a9bc9c331c00988f465a
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint: git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint: git branch -m <name>
Initialized empty Git repository in /tmp/tmpreporFb6Gq/.git/
Trying to recover missing objects from remote SOTERIA.
fatal: 'SOTERIA' does not appear to be a git repository
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Trying to recover missing objects from remote SOTERIA.
fatal: 'SOTERIA' does not appear to be a git repository
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
65 missing objects could not be recovered!
To force a recovery to a usable state, retry with the --force parameter.
# End of transcript or log.
I'm not precisely sure when the actual crash happened (could be that it happened after starting up the machine today) as the log is a bit ambiguous. All I know is that it was corrupted today and that I was making commits till yesterday ~18:00.
The objects mentioned in the log are both 6 days old; one is a commit on master and one on git-annex.
Git fsck spat out a huge list of broken blobs and trees.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
There are a surprising amount of weird bugs and quirks like these but it's such an amazing tool, there's nothing like it.
fixed although I have opened ?assistant repair misfires for part of the cause of this. --Joey
git-annex does not write directly to your git repository; all writes go via git. So it is hard to see how any corruption like this could be due to a bug in git-annex.
What it usually is caused by is a filesystem losing data, eg because of an unclean shutdown or a crash, or sometimes because of a disk error.
Setting the git config core.fsyncObjectFiles may help. Configuring the filesystem to write data to disk more promptly (eg enabling journalling of file contents) may help.
git-annex repair --force
may be able to recover from this situation.The "evacuate: strange closure" error does seem notable, something strange is going on in the memory of git-annex. Presumably due to either a compiler bug or some kind of memory problem.
This also looks suspicious:
And after that something (git-annex?) is complaining about missing git objects. So maybe you can try to set
gc.auto
to0
and see if that works better.The weird thing is, this only ever happens with my Documents git-annex repo, the only one where I use the assistant. It has never happened in 4 years of actively using regular git nor my manually committed and synced git-annex repos.
Another corruption just occured but this time on the MBP:
This happened just after I started the assistant/webapp and you can see that I shut it down in the end.
My filesystems are ZFS on the Linux machine and APFS on the MBP. Being an Apple product, I'd expect APFS to be built to high standards (though I don't fully trust it) but ZFS corrupting anything or losing data is almost unheard of.
I'll try setting
core.fsyncObjectFiles
on this repo and update this issue accordingly when I see another corruption with it on. Shouldn't take too long as this occurs worringly frequently.Btw, repair always fails in my setup because git seems to be unable to find the
SOTERIA
remote; is it trying to use the remote name literally as an address perhaps?Something peculiar just happened. I had the MBP with me while I was out and was editing documents with no connection to any of my machines.
I usually stop the assistant at home so that it only runs on one machine at a time because of https://git-annex.branchable.com/bugs/OSX58_Pushed_changes_are_autocommited/ but because I was out, I started it.
On startup, it said that it was trying to repair my repo. I shut it down because it couldn't do any repairs without a connection to my machines (and it wouldn't succeed even if it did unfortunately).
I ran
git fsck
to see what happened but, to my surprise, it returned no errors besides missing reflog entries from previous corruptions.git gc
didn't error either.I commited my changes manually while I was out.
That was about two days ago. I synced my changes manually at home where an unrelated and trivial issue happened during which I ran another successful
git fsck
and, after fixing that issue, I started the webapp.It was attempting to repair again and was complaining about a thread having crashed (logs below).
Now however, the repo is actually broken with missing blobs, trees and links everywhere.
daemon.log.1:
daemon.log:
I haven't set the fsync option yet but this system hasn't been restarted in a long time and the corruption happened while the system was on; right in front of my eyes basically.
I'll now manually fix the corruption again by rsync'ing the
.git/objects/
dir to the state of a working repo's.Another crash on the Linux machine as I was starting the assistant today. Usually it doesn't happen nearly this frequently though.
core.fsyncObjectFiles
was set totrue
this time.log.3 (2d ago 19:xx):
log.2 (today 10:16):
at the time of log.1 (10:55), corruption has already set in.
I'm sniffing an issue with the auto-repair feature wrongly detecting corruption, trying to fix it and thereby actually corrupting the repo. I'd like to turn it off for investigation, is that possible somehow? By disabling consistency checks perhaps?
Note that the strange closure crash is a separate bug, and has been fixed in git master. I do not think that bug could corrupt git repos since as I mentioned before git-annex doesn't write directly to the git repo, so any such bug in it should not be possible to corrupt it.
Anyway, it would probably be a good idea to upgrade to a current daily build to get past that bug.
strace
for mac os and make it log every syscall of git-annex assistant and it's children. This would show whichgit
action corrupts the repo and how.strace
itself but I think it'd be extremely hard to spot anything useful in that and probably is not worth the performance cost.A new crash just happened. I haven't upgraded git-annex to master yet but the strange closure bug didn't occur this time. Annotated daemon.log:
Git is now trying to merge refs/synced/d7d728f7-891a-4035-a758-c7ee80a8017a/master which points at a commit made on the MBP 4d ago. This commit has previously been merged into the original branch on this machine and contains a change to a txt render of a different .etherpad file. This also means that the commit before it contains a change to the according .etherpad file.
My branch is now a single commit made at 11:49:21 that adds the whole state of the working dir. However, the diff for the .etherpad file says (without textconv):
It added these as a unlocked largefiles despite my largefiles rule excluding them. This has happened before once I think but I manually committed at that time.
The other .etherpad files I had in the tree were not added as largefiles.
I remember having issues with textconv and annex before too and disabling a few rules that I didn't really need.
The git-annex branch seems to be in-tact. However, its commits are a bit weird. Here is
git log synced/git-annex..git-annex --patch
:synced/master is in the same state as it was 4d ago, the date of the merge commit git is now trying to merge. There have only been smallfile changes made to master since.
I'm especially confused by the changes to schedule.log.
Looks like there are some major incompatibilities with textconv and git-annex that need to be documented and/or fixed.
Smallfiles don't seem to be handled that well either, perhaps my other issues are related to using those almost exclusively in this repo?
Automatic merges have never worked for me for example. (Though I probably wouldn't want them anyways).
Is anyone else having issues with this combination of features? I can't be the only one, right?
This seems conclusive that the repair is somehow triggering unncessarily and also corrupting the repo in this situation.
The comment #3 log shows that the repair is started, and then 1 minute later a git object is missing.
(It's odd that log shows a second fsck run after the repair was already triggered. I do not see a way that this would happen unless fscks are scheduled very close together.)
The automatic repair is supposed to be a non-destructive repair; the destructive repair only happens after prompting in the UI.
This also reminds me of a persistent issue with a git-annex repo, using the assistant, on my sister's laptop corrupting itself.
The repair process moves all pack files to a temp dir and then unpacks the loose objects from them. So, there is a time window, when the repair is running, where git objects that were present before will be missing. And if the assistant stops before that is complete, it would leave it in that state. Unpacking pack files can take a long time, so this might be a sufficient explanation.
But then, something must be causing it to incorrectly think it needs a repair in the first place. Assuming it is incorrect, of course. Either git fsck is exiting nonzero for some reason, or git-annex is thinking it sees git fsck complain about a missing object, that is not really missing. While there are fsck outputs that it can misinterpret, it double-checks by trying to cat the object, which should avoid the latter problem.
To avoid moving the pack files, repair could set
GIT_OBJECT_DIRECTORY
to a temp directory, and copy each pack file into it in turn, and unpack. And after each unpack, move the unpacked objects from the temp directory to the real object directory, followed by deleting the pack file (in case it's corrupt).Removing repair from the assistant (and git-annex repair) should be on the table as a solution to this. It's a whole lot of complexity that might fix a few user's repos sometimes, but is outside of git-annex's scope and is mostly only used by assistant users.
I have verified that an interrupted repair results in data loss, when .git/objects contains pack files.
Fixed that. Which leaves 2 open questions:
Although the answer to the second question doesn't matter much if the data loss bug is fixed -- if there's a problem of some kind causing unnecessary repairs, it would only be some excess CPU load.
Hi Joey, thank you for looking into this again.
Note that I started the assistant again in that log. The consistency check didn't finish the first time because somehow corruption was detected and the assistant crashed itself in trying to repair (I didn't accept any repair prompts in the GUI that time IIRC).
AFAICT the consistency check is what triggers the issue because, since turning it off (and creating new repos on my clients), I haven't experienced a single repo corruption.
The schedule was set to whatever the assistant nudges you to set by default.
That sounds a lot more sane.
Perhaps the repair functionality could use a complete overhaul. I've also found it to be extremely slow compared to simply copying missing or corrupt objects/packs from another remote.
Ideally it should be able to restore a git repo to a working state without .git/objects present at all as it should only needs remotes and refs.
Extracting objects from the corrupt objects dir which aren't in any remote should be done if possible though because important local-only objects are a thing (stashes, additional branches, branches that are ahead of the remote's)
Manual repair and detection can stay IMO but any automatic repair needs to go entirely or be optional.
I've actually (tried) using it from the CLI too, having it as an option there would be useful.
I agree that it's out of scope though. It would probably be better off as a separate project as it's useful in any git repo, not just git-annex ones. Some integration with git-annex would be appreciated though.
Perhaps this functionality could be fully spun out into git-repair which could then become an optional add-on to git-annex.
The laptop being powered off etc. was not the case. Though I do shut down the daemon quite often because of https://git-annex.branchable.com/bugs/OSX58_Pushed_changes_are_autocommited/ (via the GUI or --stop). Do the shutdown and restart procedures check for a running repair?
I also sometimes had to kill all git processes for one reason or another but I don't think that ever happened after I pressed the repair button (I wait that out for good reason).
So current issues distilled from this thread so far:
If the assistant crashed in the middle of a repair that confirms my analysis, and my fix will avoid both the data loss problem and the crash.
Only remaining question to me is why would it trigger an unnecessary repair.
But that could be anything that causes git fsck to exit nonzero. Or it might be that git fsck found an actual problem, but not one that was preventing the repo from working. Eg, a missing/corrupt object used somewhere deep in the history.
(Note that git-repair already exists. git-annex integrates it.)