Please describe the problem.
IMHO git -c annex.retry=5 annex get -J...
should do its really best (up to 5 times per file) to obtain content.
It is really not pragmatic to demand looping on top of such a call to ensure that eventually we do get all content.
Possibly relevant older issues/todos I ran into: - https://git-annex.branchable.com/forum/is_there_a_way_to_automatically_retry_when_special_remotes_fail63/ - https://git-annex.branchable.com/todo/more_extensive_retries_to_mask_transient_failures/
What steps will reproduce the problem?
I have cloned http://datasets.datalad.org/labs/hasson/narratives/derivatives/freesurfer/.git
and did git annex get -J5 . 2>&1 | tee /tmp/annex-get-freesurfer-patched-annex-1.log
whenever
[d31548v@discovery7 freesurfer]$ git config annex.retry
3
that log file is available from http://www.onerussian.com/tmp/annex-get-freesurfer-patched-annex-1.log . That call ended with get: 2948 failed
.
I have dumped the output of annex list
into a file and now ran
grep '^_' /tmp/annex-get-freesurfer-patched-annex-1-listafter.out | awk '{print $2;}' | xargs git -c annex.retry=10 annex --debug get -J5 2>&1 | tee /tmp/annex-get-freesurfer-patched-annex-1-getabsent.log
log for which is at http://www.onerussian.com/tmp/annex-get-freesurfer-patched-annex-1-getabsent.log
and which finished with get: 130 failed
-- so clearly those original failures were not real inability to fetch content.
What version of git-annex are you using? On what operating system?
[d31548v@discovery7 git-annexes]$ git annex version
git-annex version: 8.20211123+git21-ga6699be79-1~ndall+1
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.26 DAV-1.3.4 feed-1.3.0.1 ghc-8.8.4 http-client-0.6.4.1 persistent-sqlite-2.10.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.1.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
and this is on that "fancy" NFS mounted isilon on a local HPC (a "POSIX" mount this time).
[d31548v@discovery7 freesurfer]$ git config annex.pidlock
true
per my setting in ~/.gitconfig
(which also has annex.retry = 3
)
It seems that every failure in that log is preceeded by "transfer already in progress, or unable to take transfer lock".
It does not make sense to retry in such a case because the transfer will probably still be in progress so the retry would fail.
So I don't think this really involves retrying (which certianly supports S3). The question is why are several transfers failing this way. My suspicion would be something to do with pid locking.
And indeed, I can reproduce the same behavior in a local repo getting from another local repo, once I set annex.pidlock.
What's happening is Utility.LockFile.PidLock.tryLock calls trySideLock and in these failure cases, that returns Nothing, because another thread also happens to have taken the sidelock.
Due to the concurrency, the pid lock file always already exists, so linkToLock fails. When it is able to take the side lock, it treats this as a stale pid lock file situation, and takes over the pid lock.
(This could also affect other locks than transfer locks, potentially.)
One way to solve it would be to use a LockPool instead to take the side lock. Then multiple concurrent threads could all lock the side lock.
Or, it could special case when the pid in the pid lock file is the same as the current pid, and handle it in the lock file take over code.
Now, annex.pidlock is supposed to be a big top-level lock, which is used instead of the fine-grained locking. What if 2 threads are each wanting to take a lock before operating on the same resource? If either of the solutions above is implemented, then both threads will "succeed" at locking even though it's a shared pidlock. Which could result in any undefined behavior.
And, in all the cases where it does not fail to take the transfer lock, but instead takes over the pid lock, we're perilously close to such a thing happening! The only reason it's not a problem, in the case of transfers is that OnlyActionOn is used and prevents two threads transferring the same key. (Also used for dropping etc.)
But this makes me worry that using annex.pidlock with concurrency enabled is flirting with disaster, and perhaps it should refuse to use concurrency to avoid such potential problems. Unless there's a way to avoid them entirely.
Hmm.. Normally all locking in git-annex uses LockPool, which handles inter-process locking. If LockPool is used, one of those 2 threads will win at the lock and the other one will wait for it. But, Annex.LockPool.PosixOrPid.tryPidLock/pidLock do not use LockPool for fine-grained locking when pid locking is enabled.
So, I think it's potentially buggy in the unsafe direction of letting 2 threads "lock" the same resource, as well as in the safe direction of sometimes unncessarily failing. Both need to be fixed.
I am leaning toward a process only taking the pid lock once and holding it until exit, with LockPool used to make that thread safe. And add fine grained locking using LockPool when doing pid locking.
Utility.LockPool.PidLock does use a LockPool for the pid lock already, but with LockShared, which seems rather pointless since it will never make any thread block and multiple threads can still try to make the pid lock file at the same time.
I've fixed the transfer locking problem. Also drop -J did sometimes fail due to a similar locking problem and should be fixed.
Still want to add the fine-grained STM locking when doing pid locking, as discussed above.
I noticed that there is a FD leak when pidlock is enabled. About 1 file in 10 that is processed, there is a handle to ".git/annex/pidlock.lck" that is left open. This is not a regression caused by my fix above, bug is in older git-annex too.
Also, I have the fine-grained locking implemented in the pidlockfinegrained branch now, but am waiting to merge it for testing and to resolve the above problem first.
8.20211123+git32-g9c0f3d1de-1~ndall+1_amd64
build a test run on that system, and it performed nicely (did get all files in my limitedget -J5
attempt on that dataset). Will be waiting for the ultimate solution to try.Unfortunately the more I dig into pid locking, the more deep problems I'm finding..
When the pid lock's LockHandle is dropped, it drops the pid lock, and that happens even if there is another LockHandle for the pid lock, which is possble since it's a shared lock. So the pid lock may go away despite a thread is still operating as if it is present. I think probably the pid lock needs to stop using the LockPool and use a single LockHandle, which is ref counted.
getLockStatus
andcheckSaneLock
look at the status of the pid lock, but not yet at the fine-grained STM lock status. And as implemented, I think they never worked at all, since they check for posix locks on the pid lock file.Unfortunately, 66b2536ea0aa5c88f5b744eeace3322a8a4a10b6 broke shared locking when pid locks are not used, causing a FD leak, and had to be reverted. Which further breaks concurrent pid locking.
At this point, when -J is used with pidlock, it may drop the lock but still think it has it held internally. Comment #7 found another way that could happen, and the conclusion is that the pid lock must migrate away from using the lock pool.
Update: All problems noted above are fixed, fine-grained locking is merged, and it seems to work well!