double rsync run and test failure

might be a flaky test/operation, happened while building in i386 buster environment:

neurodebian@smaug ..7.20190730+git103-g070fbd693-1~ndall+1 % grep -B20 -A1 FAIL git-annex_7.20190730+git103-g070fbd693-1~ndall+1_i386.build
             20 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) (recording state in git...)
ok
(recording state in git...)
OK (1.00s)
    move (ssh remote):                                    move foo (from origin...) 
(checksum...) ok                  
(recording state in git...)
move foo (to origin...) 
ok                                
(recording state in git...)
OK (0.99s)
    copy:                                                 copy foo (from origin...) 
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
             20 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)

SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
             20 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)
failed
copy: 1 failed
FAIL (0.34s)
      Test.hs:530:

I have tried to rebuild (debian/rules binary) second time while in the same environment -- didn't succeed... but then I proceeded to build amd64 fine, and redid i386 build just fine

fixed though w/o a true root cause analysis --Joey

RSS Atom

comment 1

This is the same problem I tried to deal with in f27c5db5c566bdc0baae256b67df04a50027679f. Apparently not fully

Or at least the double run of rsync with no indication why it thought the first one failed appears to be the same. And IIRC I saw that sometimes without the permissions error and it still failed.

I was having to run git-annex test in a loop for an hour or so to reproduce it.

So, for some reason git-annex thinks the rsync failed, although it got to 100%. If that happens once, it retries automatically. The second time it thinks rsync has failed, it sees the file didn't get any larger, and so gives up. That makes sense, but I don't know how rsync would get to 100% and then apparently exit nonzero, without displaying any error message.

Comment by joey — Tue Aug 13 17:28:15 2019

Remove comment

comment 2

and git-annex cannot rely on the check that size of the receiving file has reached the target size?

Comment by yarikoptic — Tue Aug 13 18:15:36 2019

Remove comment

comment 3

No, it cannot in general rely on that. Consider a file transfer program that allocates the whole file full of 0 to start.

This needs to be reproduced in strace so we can see what it happening with rsync.

Comment by joey — Tue Aug 13 19:14:41 2019

Remove comment

comment 4

And even if we assume rsync never pre-allocates a file before receiving it, it probably does do some things at the end, like setting the final permissions and timestamp.

The permissions error looked like this:

get foo (from origin...)
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
             20 100%    0.00kB/s    0:00:00  ^M             20 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)
(from origin...)
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
             20 100%    0.00kB/s    0:00:00  ^M             20 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)
rsync: open "/home/joey/src/git-annex/.t/tmprepo1103/.git/annex/tmp/SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77" failed: Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]

That looked as if the first rsync had gotten as far as removing the write bit from the file. Interestingly, the second rsync seems to have received the whole (small) file content before ever trying to open the file for write.

The only recent change I can think of involving rsync was the CoW probing change, but I don't see how that could possibly lead to this behavior.

And I'm not sure this is a new problem. The test suite has been intermittently failing for several months, around this same point. The failure did not include any useful error message, so I could not debug it, and I have IIRC done a few things to try to get the test suite to display an error message. Perhaps I succeeded.

The intermittent test suite failure looks like this:

copy: [adjusted/master(unlocked) 05b89a6] empty
adjust ok
copy foo (from origin...)
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
20 100% 0.00kB/s 0:00:00 20 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)

SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
20 100% 0.00kB/s 0:00:00 20 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
failed
copy: 1 failed
FAIL (1.12s)

I am not sure if it only happens when testing adjusted unlocked branches/ v7 unlocked files.

I've run git-annex get; git-annex drop in a tight loop for thousands of iterations on an adjusted unlocked branch, and can't reproduce the failure that way.

I've made git-annex display rsync's exit status when it's not 0 or 1, it has a lot of other exit statuses, so perhaps that will help track down how it's failing.

Comment by joey — Thu Aug 15 18:00:42 2019

Remove comment

comment 5

I tested a git-annex made to say when the rsync process succeeded, and rsync is actually not failing.

So something in git-annex is silently making the transfer fail. Could be, for example, that it thinks the file was modified while it was being transferred.

Comment by joey — Thu Aug 15 21:18:20 2019

Remove comment

comment 6

It is indeed detecting that the file it was sending appears to have been modified after the download started.

Probably the file has not really been modified. Instead the inode cache for it is somehow wrong.

An instrumented run had cached:

InodeCache (InodeCachePrim 21765306 20 (MTimeHighRes 1565905801.196804558s)
InodeCache (InodeCachePrim 22158907 20 (MTimeHighRes 1565905801.196804558s))

And after the download the file it was sending had:

InodeCache (InodeCachePrim 21765305 20 (MTimeHighRes 1565905802.380791792s))

Note that the test suite moves the same file from origin, and then moves it back, and a while after that get fails. So at least one of the cached inodes is for the old copy of the file. The other one is probably the work tree copy.

Still have not had luck reproducing outside the test suite with a tight loop of move --from and --to and get.

Hypothesis: The test suite uses Annex.main for many of its runs of git-annex, and so sqlite does not do whatever it normally does atexit. If flushing a change to the db file gets deferred for whatever reason, a later call to Annex.main might see old information.

If so, converting the test suite to run git-annex instead of Annex.main would avoid the problem.

Comment by joey — Thu Aug 15 21:55:43 2019

Remove comment

comment 7

Confirmed; avoiding calling main has gotten the test suite to be able to run at least 500 iterations over 12 hours w/o failing. It had been failing in under 1 hour.

Comment by joey — Fri Aug 16 15:03:11 2019

Remove comment

Add a comment