Please describe the problem.
Well, I remember whining before on this issue. It does cause some notable inconvenience when trying to automate use of annex -- I am left without a choice but try to reget files multiple times without knowing what was actual cause for it to fail to start with. Originally I observed it while content had to be wget'ed, to which I thought "oh well -- might be some connection overload etc". But now I have tried on a repository which is local to that drive -- there must be no problem accessing multiple files at once whatsoever. But on a trial X with -J5 I have error: 613, ok: 255, on next call error: 415, ok: 198 and so on -- kinda converging but imho it must not be so difficult. I do suspect some race condition in annex itself preventing correct operation.
What steps will reproduce the problem?
git annex get -J10 on some well sized annex. Possibly with original annex to fetch content from just right on the same drive
What version of git-annex are you using? On what operating system?
6.20170408+gitg804f06baa-1~ndall+1
Please provide any additional information below.
$> git annex get -J5 --json | grep -v '"success":true' 2>&1 | head
{"command":"get","wanted":[{"here":false,"uuid":"3db23446-8c40-441e-97ec-55ffc86b4fc0","description":"yoh@smaug:~/proj/datalad/datalad/.git/travis-ci/origin-annex [origin]"}],"note":"Try making some of these repositories available:\n\t3db23446-8c40-441e-97ec-55ffc86b4fc0 -- yoh@smaug:~/proj/datalad/datalad/.git/travis-ci/origin-annex [origin]\n","skipped":[],"success":false,"key":"SHA256E-s328--c2eb8088cdc71a0d4cbd660312bef5421a47ce7da3655efdb17712e7188be4a1.txt.gz","file":"3728/3728.9-None.txt.gz"}
{"command":"get","wanted":[{"here":false,"uuid":"3db23446-8c40-441e-97ec-55ffc86b4fc0","description":"yoh@smaug:~/proj/datalad/datalad/.git/travis-ci/origin-annex [origin]"}],"note":"Try making some of these repositories available:\n\t3db23446-8c40-441e-97ec-55ffc86b4fc0 -- yoh@smaug:~/proj/datalad/datalad/.git/travis-ci/origin-annex [origin]\n","skipped":[],"success":false,"key":"SHA256E-s62840--c20189a229fac622bb781650af394cf40367b5563a833885480f
825fdbf29b47.txt.gz","file":"3729/3729.1-0.txt.gz"}
...
$> git annex get --key SHA256E-s328--c2eb8088cdc71a0d4cbd660312bef5421a47ce7da3655efdb17712e7188be4a1.txt.gz
get SHA256E-s328--c2eb8088cdc71a0d4cbd660312bef5421a47ce7da3655efdb17712e7188be4a1.txt.gz (from origin...)
SHA256E-s328--c2eb8088cdc71a0d4cbd660312bef5421a47ce7da3655efdb17712e7188be4a1.txt.gz
328 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
(recording state in git...)
Let's not diagnose a concurrency problem prematurely.
I've run many concurrent gets and never seen masses of failures, or indeed any failures without either an error message explaining why it failed, or something obviously wrong (like the drive not being mounted).
The json output is making it harder than necessary to understand what's going on. It seems you should be easily able to replicate the same problem without --json.
didn't need to go far ;)
FWIW, the repository in question is this one: http://datasets.datalad.org/devel/travis-buildlogs/.git/
That looks like concurrent
git config
setting remote.origin.annex-uuid are failing.I have not reproduced the
.git/config
error, but with a local clone of a repository, I have been able to reproduce some intermittent "transfer already in progress, or unable to take transfer lock" failures withgit annex get -J5
, happening after remote.origin.annex-uuid has been cached.So, two distinct bugs I think..
Debugging, the lock it fails to take always seems to be the lock on the remote side, which points to the local clone being involved somehow.
Debugging further, Utility.LockPool.STM.tryTakeLock is what's failing. That's supposed to only fail when another thread holds a conflicting lock, but as it's implemented with
orElse
, if the main STM transaction retries due to other STM activity on the same TVar, it will give up when it shouldn't.That's probably why this is happening under heavier concurrency loads; it makes that failure case much more likely. And with a local clone, twice as much locking is done.
I've fixed this part of it!
The concurrent
git config
part remains. Since git-annex can potentially have multiple threads doing differentgit config
for their own reasons concurrently, it seems it will need to add its own locking around that.The .git/config concurrent access happens because the remote list is only generated on demand, and nothing demands it when running with -J until all the threads are spun up and each thread has its own state then, so each generates the remote list.
There don't look to be any other git-config settings that would cause problems for concurrency other than ones run while generating the remote list.
So, generating the remote list before starting concurrent threads would avoid that problem, and also leads to a slightly faster startup since the remote git config only has to be read once, etc.
The only risk in doing that would be if generating a Remote opens some kind of handle, which can't be used concurrently, or is less efficient if only opened once and then used by multiple threads.
I've audited all the Remote.*.gen methods, and they all seem ok. For example, Remote.External.gen sets up a worker pool of external special remote processes, and new ones are allocated as needed. And Remote.P2P.chainGen sets up a connection pool.
Ok, gone ahead with this fix.
Thank you Joey! It seems to work very nice now! Not a single one lost out of 1550!