Please describe the problem.
happens while running git annex get -J5 .
on http://datasets.datalad.org/labs/hasson/narratives/derivatives/fmriprep/.git
filesystem is that NFS mounted lustre where I thought we resolved all the issues (spotted by annex test
, yet to redo)
the entire run has failed with get: 128 failed
It is consistent. Tried in another dataset http://datasets.datalad.org/labs/hasson/narratives/derivatives/freesurfer/.git/ and after one crashed run (kerberos key timeout, leading to other errors), started a new annex get -J5
for which log in full is http://www.onerussian.com/tmp/annex-get-freesurfer-5.log and which ends with get: 89 failed
What version of git-annex are you using? On what operating system?
conda linux nodep (standalone) build 8.20211012-geb95ed486 and then with another standalone build (debian, extracted on that CentOS) 8.20211117+git14-ge1f38b9dd
done provisionally, followup if further testing shows otherwise --Joey
I think that this error comes from Utility.LockFile.PidLock.tryLock, which has the only getFileStatus involving the pidlock whose exceptions are not caught. The file is assumed to exist since it was just created, and normally nothing deletes it.
While looking at where this might come from, I refreshed my memory of how Lustre can to do insane stuff like having 2 different files with the same name in a directory. Which checkInsaneLustre tries to deal with by deleting one of them, but since this is all behavior undefined by POSIX, maybe that sometimes deletes both of them. Or the file doesn't appear after being created for some other POSIX-defying reason.
I've changed it to catch exceptions from that getFileStatus, which will test this theory.
8.20211231+git69-ga55fc567c-1~ndall+1
-- thank you!