bugs/git fsck duplicateEntries errors when using adjusted branchgit-annexhttp://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/git-annexikiwiki2021-03-02T22:09:45Zcomment 1http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_1_1a6674245aed0c325361043d1100daec/justin.lebar2018-07-08T19:53:08Z2018-07-08T19:53:08Z
<p>OK, that systematic check for duplicates is not right. A better check is, how many blobs are not duplicated, and how many are duplicated, in the <code>git ls-tree</code> output?</p>
<pre><code># Not duplicated blobs
$ git fsck |& grep duplicateEntries | cut -f 4 -d ' ' | sed -e 's/://' | xargs -n1 git ls-tree | grep -v ' tree ' | uniq -u | wc -l
324
# Duplicated blobs
$ git fsck |& grep duplicateEntries | cut -f 4 -d ' ' | sed -e 's/://' | xargs -n1 git ls-tree | grep -v ' tree ' | uniq -u | wc -l
416
</code></pre>
comment 2http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_2_ccadd646ec0fe7abec461c110bfb7ae7/joey2018-07-16T15:59:34Z2018-07-16T15:27:13Z
<p>Something more is involved than git annex add and sync on an adjusted
branch, because when I try that the problem does not appear.
(Also tried on OSX in case it was somehow OSX specific.)</p>
<p>I see that the mode of the files in the problem tree you showed
is 120000, which tells me that tree is one that was committed to the master
branch, not the adjusted unlocked branch. That suggests that the problem is
in reverseAdjustedTree. But I don't think I will be able to find it by
staring at the code; I need a way to reproduce the problem.</p>
<p>What <em>might</em> help is if you can show the full directory tree in the
repository (with names mangled for privacy as needed). It may be that it's
somehow caused by adding a file at a particular location in the tree,
since a lot of the complication in the code is around handling such things.</p>
comment 3http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_3_2a6ef51c96a914be5d248b526ad37394/jlebar2018-07-17T05:12:03Z2018-07-17T05:12:03Z
<p>Thanks a lot, Joey.</p>
<p>I wrote a script that replaces each directory or filename with a salted hash:</p>
<pre><code>#!/usr/bin/env python
import sys
import hashlib
def hash(s):
m = hashlib.sha256()
m.update(s)
m.update('<secret>')
return m.hexdigest()
for line in sys.stdin:
print '/'.join(hash(p) for p in line.split('/'))
</code></pre>
<p>Then I ran</p>
<pre><code>$ git ls-files | python hash_paths.py | bzip2 > repo_paths.bz2 # attached
</code></pre>
<p>To make something you can correlate with the git fsck errors, I ran</p>
<pre><code>$ git fsck |& grep duplicateEntries | cut -f 4 -d ' ' | sed -e 's/://' | xargs -n1 git ls-tree | grep -v ' tree ' > blobs
$ paste <(cut -f 1 blobs) <(cut -f 2 blobs | python hash_paths.py) | bzip2 > fsck_errors.bz2 # attached
</code></pre>
<p>So the second column in fsck_errors is the salted+hashed filename like in the comment above. You should be able to correlate the "filenames" in fsck_errors with the paths in repo_paths.</p>
<p>I'll email you the relevant files.</p>
comment 4http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_4_f9588887f75c0d17b87f2718b0584a94/joey2018-07-17T15:04:09Z2018-07-17T14:59:26Z
<p>Looking at those files, all the duplicated files
are located at least 2 subdirectories deep, and most more like 5-7
deep. But, almost all of the repo is inside such
subdirectories, so that is not conclusive.</p>
<p>Aha! I managed to reproduce it:</p>
<pre><code>joey@darkstar:~/tmp/t#master(unlocked)>mkdir ook
joey@darkstar:~/tmp/t#master(unlocked)>cd ook
joey@darkstar:~/tmp/t/ook#master(unlocked)>mkdir boop
joey@darkstar:~/tmp/t/ook#master(unlocked)>cd boop
joey@darkstar:~/tmp/t/ook/boop#master(unlocked)>mkdir beep
joey@darkstar:~/tmp/t/ook/boop#master(unlocked)>cd beep
joey@darkstar:~/tmp/t/ook/boop/beep#master(unlocked)>mkdir yeep
joey@darkstar:~/tmp/t/ook/boop/beep#master(unlocked)>cd yeep
joey@darkstar:~/tmp/t/ook/boop/beep/yeep#master(unlocked)>date > X
joey@darkstar:~/tmp/t/ook/boop/beep/yeep#master(unlocked)>git annex add
add X ok
(recording state in git...)
joey@darkstar:~/tmp/t/ook/boop/beep/yeep#master(unlocked)>git annex sync
commit
[adjusted/master(unlocked) fe11872] git-annex in joey@darkstar:~/tmp/t
1 file changed, 1 insertion(+)
create mode 100644 ook/boop/beep/yeep/X
ok
joey@darkstar:~/tmp/t/ook/boop/beep/yeep#master(unlocked)>git fsck
error in tree a5fcd5b3aa5189ed8916f025cf035fce74098a1a: duplicateEntries: contains duplicate file entries
Checking object directories: 100% (256/256), done.
</code></pre>
<p>Ok, it seems it's simply caused by deep paths in the tree and nothing
else. Will debug from here.</p>
comment 5http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_5_a9d6cf6cd38bfa6f391beaf6831b0ef4/joey2018-07-17T18:42:00Z2018-07-17T15:12:12Z
<p>I wrote a reproduction script, which worked several times, and then stopped
working. Now I can't reproduce it using that script. Some kind of race condition?
Only happens before coffee o-clock? I don't know. <img src="http://git-annex.branchable.com/smileys/ohwell.png" alt=":-/" /></p>
<p>The script:</p>
<pre><code>#!/bin/sh
sudo rm -rf /tmp/repo
git init /tmp/repo
cd /tmp/repo
git annex init
date > foo
git annex add foo
git annex sync
git annex upgrade
git annex adjust --unlock
mkdir -p ook/boop/beep/yeep
date > ook/boop/beep/yeep/x
git annex add
git annex sync
git fsck
</code></pre>
<p>I also noticed that, once a tree gets duplicate entries in it, they are
carried forward into the new trees when other commits are made to that
directory. This may explain why fsck is finding so many trees to complain
about in your repsitory.</p>
<p>The commit made to the adjusted branch does not have a duplicate in the tree.
The reverse adjusted commit made to master does. So it must involve
adjustTree somehow.</p>
<p>I don't see anything likely to cause a race condition in adjustTree.
However, I do think that v6 mode has many bugs some of which may be
race conditions, and perhaps the root cause is not adjustTree.</p>
<p>This seems likely related in some way to
<span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=bugs%2Fgit_fsck_duplicateEntries_errors_when_using_adjusted_branch%2Fcomment_5_a9d6cf6cd38bfa6f391beaf6831b0ef4&page=bugs%2Fadd_fails_with_v6_repo_when_four_levels_deep" rel="nofollow">?</a>add fails with v6 repo when four levels deep</span>,
which I incidentially reproduced for the first time (and then fixed) while
trying again to reproduce this bug.</p>
comment 6http://git-annex.branchable.com/bugs/git_fsck_duplicateEntries_errors_when_using_adjusted_branch/comment_6_cfefa7ac1f538566ee028aa3996e22cd/joey2021-03-02T22:09:45Z2021-01-29T19:23:19Z
<p>An older bug <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=bugs%2Fgit_fsck_duplicateEntries_errors_when_using_adjusted_branch%2Fcomment_6_cfefa7ac1f538566ee028aa3996e22cd&page=git-annex_wants_to_repair_because_of_duplicateEntries_in_git_fsck" rel="nofollow">?</a>git-annex wants to repair because of duplicateEntries in git fsck</span>
also had duplicate directory entries. I'm not clear if adjusted branches
were involved there.</p>
<p>I can't reproduce it using that script that was working to reproduce it for
a while for me.</p>