annex.pidlock can cause git-annex get -J to fail to take transfer lockyohhttp://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/git-annexikiwiki2023-01-05T17:30:31Zcomment 1http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_1_4de90238d57de0d49b75418411135dbc/joey2023-01-05T17:30:31Z2021-12-01T16:39:05Z
<p>It seems that every failure in that log is preceeded by "transfer already
in progress, or unable to take transfer lock".</p>
<p>It does not make sense to retry in such a case because the transfer will
probably still be in progress so the retry would fail.</p>
<p>So I don't think this really involves retrying (which certianly supports
S3). The question is why are several transfers failing this way.
My suspicion would be something to do with pid locking.</p>
<p>And indeed, I can reproduce the same behavior in a local repo getting from
another local repo, once I set annex.pidlock.</p>
comment 2http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_2_5da78a05b781029fa7f0f9d8ead7e093/joey2023-01-05T17:30:31Z2021-12-01T17:05:14Z
<p>What's happening is Utility.LockFile.PidLock.tryLock
calls trySideLock and in these failure cases,
that returns Nothing, because another thread also happens to
have taken the sidelock.</p>
<p>Due to the concurrency, the pid lock file always already exists, so
linkToLock fails. When it is able to take the side lock, it treats this as
a stale pid lock file situation, and takes over the pid lock.</p>
<p>(This could also affect other locks than transfer locks, potentially.)</p>
<p>One way to solve it would be to use a LockPool instead to take the side
lock. Then multiple concurrent threads could all lock the side lock.</p>
<p>Or, it could special case when the pid in the pid lock file is the same as
the current pid, and handle it in the lock file take over code.</p>
<hr />
<p>Now, annex.pidlock is supposed to be a big top-level lock, which is used
instead of the fine-grained locking. What if 2 threads are each wanting
to take a lock before operating on the same resource? If either of the
solutions above is implemented, then both threads will "succeed" at locking
even though it's a shared pidlock. Which could result in any undefined
behavior.</p>
<p>And, in all the cases where it does <em>not</em> fail to take the transfer lock,
but instead takes over the pid lock, we're perilously close to such a thing
happening! The only reason it's not a problem, in the case of transfers
is that OnlyActionOn is used and prevents two threads transferring the same
key. (Also used for dropping etc.)</p>
<p>But this makes me worry that using annex.pidlock with concurrency enabled
is flirting with disaster, and perhaps it should refuse to use concurrency
to avoid such potential problems. Unless there's a way to avoid them
entirely.</p>
<p>Hmm.. Normally all locking in git-annex uses LockPool, which
handles inter-process locking. If LockPool is used, one of those 2 threads
will win at the lock and the other one will wait for it. But,
Annex.LockPool.PosixOrPid.tryPidLock/pidLock do not use LockPool
for fine-grained locking when pid locking is enabled.</p>
<p>So, I think it's potentially buggy in the unsafe direction of letting
2 threads "lock" the same resource, as well as in the safe direction of
sometimes unncessarily failing. Both need to be fixed.</p>
<p>I am leaning toward a process only taking the pid lock once and holding it
until exit, with LockPool used to make that thread safe. And add fine grained
locking using LockPool when doing pid locking.</p>
<p>Utility.LockPool.PidLock does use a LockPool for the pid lock
already, but with LockShared, which seems rather pointless since it will never make
any thread block and multiple threads can still try to make the pid lock
file at the same time.</p>
comment 3http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_3_b279da2a2db1ba5a24a76a9112e691aa/joey2023-01-05T17:30:31Z2021-12-01T20:25:18Z
<p>I've fixed the transfer locking problem. Also drop -J did sometimes fail
due to a similar locking problem and should be fixed.</p>
<p>Still want to add the fine-grained STM locking when doing pid locking,
as discussed above.</p>
comment 4http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_4_089f1d2d7de435a98dc1098ebf1cf3df/yarikoptic2023-01-05T17:30:31Z2021-12-02T13:05:55Z
Just a note that I believe locking changes broke windows build, can see log <a href="https://github.com/datalad/git-annex/runs/4390439197?check_suite_focus=true#step:18:205">here</a>
comment 5http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_5_3b840d50a1ecc9cb6e161e3c039fe332/joey2023-01-05T17:30:31Z2021-12-03T18:07:56Z
Indeed, fixed that.
comment 6http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_6_2c83482ead69a902c99fca3ae63dfc34/joey2023-01-05T17:30:31Z2021-12-03T20:37:18Z
<p>I noticed that there is a FD leak when pidlock is enabled.
About 1 file in 10 that is processed,
there is a handle to ".git/annex/pidlock.lck" that is left open.
This is not a regression caused by my fix above, bug is in older git-annex
too.</p>
<p>Also, I have the fine-grained locking implemented in the pidlockfinegrained
branch now, but am waiting to merge it for testing and to resolve the above
problem first.</p>
comment 7http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_7_51f6745c4abce7b3cdfcf1c5998f8329/yarikoptic2023-01-05T17:30:31Z2021-12-03T21:41:53Z
Great - thank you! FWIW I gave <code>8.20211123+git32-g9c0f3d1de-1~ndall+1_amd64</code> build a test run on that system, and it performed nicely (did get all files in my limited <code>get -J5</code> attempt on that dataset). Will be waiting for the ultimate solution to try.
comment 7http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_7_d8ccfe3fe0d5123c1065ace73b623b44/joey2023-01-05T17:30:31Z2021-12-03T22:35:16Z
<p>Unfortunately the more I dig into pid locking, the more deep problems I'm
finding..</p>
<p>When the pid lock's LockHandle is dropped, it drops the pid lock, and that
happens even if there is another LockHandle for the pid lock, which is
possble since it's a shared lock. So the pid lock may go away despite
a thread is still operating as if it is present. I think probably the pid
lock needs to stop using the LockPool and use a single LockHandle, which is
ref counted.</p>
<p><code>getLockStatus</code> and <code>checkSaneLock</code> look at the status of the pid lock,
but not yet at the fine-grained STM lock status. And as implemented,
I think they never worked at all, since they check for posix locks on the
pid lock file.</p>
comment 9http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_9_df98c47016e79f15218880a009e671ea/joey2023-01-05T17:30:31Z2021-12-06T16:36:45Z
<p>Unfortunately, <a href="http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=66b2536ea0aa5c88f5b744eeace3322a8a4a10b6">66b2536ea0aa5c88f5b744eeace3322a8a4a10b6</a>
broke shared locking when pid locks are not used, causing a FD leak,
and had to be reverted. Which further breaks concurrent pid locking.</p>
<p>At this point, when -J is used with pidlock, it may drop the lock but
still think it has it held internally. Comment #7 found another way that
could happen, and the conclusion is that the pid lock must migrate away
from using the lock pool.</p>
comment 10http://git-annex.branchable.com/projects/datalad/bugs-done/annex_get_should_retry_failed_downloads_from_S3/comment_10_1cbef11391d23210230872a75f8e5414/joey2023-01-05T17:30:31Z2021-12-06T19:11:20Z
<p>Update: All problems noted above are fixed, fine-grained locking is merged,
and it seems to work well!</p>