Please describe the problem.
Ideally annex should detect all "paranormal" cases such as running on NFS mounted partition, but according to https://git-annex.branchable.com/bugs/huge_multiple_copies_of39.nfs4239and39.panfs4239being_created/. Happily ignorant we were running annex (5.20151116-g76139a9) on NFS mounted partition until we filled up 2TB of allocated to us space with .nfs* files. Well -- apparently according to above we should have tried pidlock... trying now but doesn't work :-/
*$> git clone smaug:/tmp/123 123-clone && cd 123-clone && git config annex.pidlock true && echo 124 > 124.dat && git annex add 124.dat && git commit -m 'added 124' && git annex move --to=origin 124.dat
Initialized empty Git repository in /home/yhalchen/123-clone/.git/
remote: Counting objects: 22, done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 22 (delta 3), reused 0 (delta 0)
Receiving objects: 100% (22/22), done.
Resolving deltas: 100% (3/3), done.
total 1
1 123.dat@ 1 README.txt
(merging origin/git-annex into git-annex...)
(recording state in git...)
add 124.dat ok
(recording state in git...)
[master 0f1092a] added 124
1 files changed, 1 insertions(+), 0 deletions(-)
create mode 120000 124.dat
move 124.dat (checking origin...) git-annex: content is locked
$> echo $?
1
BTW running move in our old now somewhat screwed up annex, results in a differently expressed error: http://www.onerussian.com/tmp/2016-02-29.png
Looks to me like I fixed the reported problem 3 years ago, in 1a8ba7eab4677be18fef5d6e1c2e59607594c446.
That fix was never confirmed to work, but I see no reason why it wouldn't, so if it didn't please reopen. done
(As far as detecting nfs and enabling pidlock, there is https://git-annex.branchable.com/bugs/huge_multiple_copies_of___39__.nfs__42____39___and___39__.panfs__42____39___being_created/) --Joey
Oddly, I cannot reproduce this, although I can reproduce the behavior in http://git-annex.branchable.com/bugs/thread_blocked_indefinitely_in_an_STM_transaction__while_moving_within__a_local_clone/a
(smaug:/tmp/123 has permissions that do not let me access it.)
I've fixed the STM transaction bug. Need either more info to reproduce this bug, or you could test and see if it still occurs when git-annex is upgraded to ad888a6b760e8f9d31f8d99c51912bcdaa7fb0c1
If we could remove those .nfs* files, it would indeed be not that bad but we can't
smaug:/tmp/123 -- sorry about permissions but it is a regular annex nothing special, so the bug should show itself with other repos as well I think. I gave you access to it now and also there is /tmp/123.tar.gz archive of it just in case.
That ssh lock file is created by this code:
But, that does not ever actually take a lock on the file, so NFS should not make its .nfs thing in this case. Unless NFS does it when a FD is simply opened with close-on-exec set.
Can you get a strace of the creation of files under .git/annex/ssh/ that result in these .nfs things?
Backs up what I thought git-annex should be doing; it's not fcntl locking that file.
Ah, I'll bet it's not git-annex at all this time. It runs ssh with -S .git/annex/ssh/smaug, and ssh probably does its own locking around setting up that control socket.
If so, disabling annex.sshcaching will avoid the problem.
so -- shouldn't annex at least upon init sense if repo is under nfs? if to be done platform independent way then it could do smth like
somewhere under .git/annex/tmp .. so if .nfs* file gets generated -- under nfs. Seems to work for me in limited set of tests -- assertion fails all the time under NFS
Looking at the strace finally...
Here's git-annex creating the dummy posix lock file, soon after the ssh socket is removed. (I think that's ssh removing the socket before, but the jumble of strace output is hard to follow here.)
And later ssh is told to stop using a related socket:
Which it seems does not exist by then:
Seems like this would have to be
sshCleanup
running, andenumSocketFiles
seeing ".nfs00000000099d85d000000002" existing at that point due probably to the ssh/smaug socket being in the process of being removed.So, it assumes it's a socket file and tries to clean it up, creating the dummy posix lock file. And posix lock files don't get deleted (it's impossible to do so in a race-free way), so this results in more and more
.nfs*.lock
files.Need to find a way to prevent
enumSocketFiles
from listing these files. Seems that ".nfs00000000099d85d000000002" is probably the deleted form of the "smaug" socket, so it really is a socket file, so checking if a file is a socket won't help. Of course it could explicitly filter out ".nfs" prefixed files. That could be a case of kicking the can down the road, since NFS could always choose to use another name for these files.Hmm.. Before a ssh socket file gets created, git-annex always locks the associated lock file. And as noted, the posix lock file is never removed. (On Windows the lock file is also created and never deleted.) So, if $file does not have an associated $file.lock then $file must not be a ssh socket, and
enumSocketFiles
can skip it.I've added that check. Unable to test it because the NFS mount has been lost in the intervening time. @yoh can you test this fixed it?
Re detecting NFS, I'm not sure that ".nfs" files are always used for deleted files by all NFS configurations. And, there are probably NFS configurations that do properly support Posix locks, and others that don't. So connecting a check for ".nfs" files with turning on annex.pidlock seems problimatic.
If we had a good way to detect systems that don't support Posix locks, annex.pidlock could be auto-enabled. But for some reason embedding a large portion of a Posix test suite into git-annex does not fill me with joy.