Please describe the problem.
Unclear reason why some files/keys fail to get in parallel.
What steps will reproduce the problem?
In my case it is getting files from the datalad-archives special remote which fetches the tarball key, extracts, and copies for annex.
I've implemented locking in datalad-archives on getting the tarball key and extracting the archive, so now we can run get -JX
and it generally works. But when there is lots of files in the tarball, for some of them (seems to be up to the X in -JX) transfer fails.
git annex simply reports
$> grep already git-annex-getJ5-5.log
{"command":"get","wanted":[{"here":false,"uuid":"79080a38-0e94-4a0a-bd89-9022eada547b","description":"yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/crcns/aa-1"},{"here":false,"uuid":"895b9a07-6613-4c8a-95ae-280d8119475c","description":"[datalad-archives]"}],"note":"transfer already in progress, or unable to take transfer lock\nUnable to access these remotes: datalad-archives\nTry making some of these repositories available:\n\t79080a38-0e94-4a0a-bd89-9022eada547b -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/crcns/aa-1\n \t895b9a07-6613-4c8a-95ae-280d8119475c -- [datalad-archives]\n","skipped":[],"success":false,"key":"MD5E-s1001--2cd1bc42ddd745e7d5c00edb07d8c9d4","file":"MLd_cells/gg0304_6_A/conspecific/spike20"}
.... 4 more
$> find -lname */MD5E-s1001--2cd1bc42ddd745e7d5c00edb07d8c9d4
./MLd_cells/gg0304_6_A/conspecific/spike20
so it is not that the key is used for multiple files (I remember we had that before, so checked for that first). According to the logs (git-annex and datalad) that key is not even passed from annex to our special remote so somehow it freaks out and skips it.
If needed, I could probably provide you a singularity image with the environment with datalad pre-installed with that branch so you could troubleshoot.
What version of git-annex are you using? On what operating system?
Tried with bleeding edge 6.20180308+gitg3962ca71b-1~ndall+1 although originally detected with 6.20180220+gitg811d0d313-1~ndall+1
I was able to reproduce this with two normal git-annex repos and
git annex get -J10
. 6.20180227.tryLockExclusive is returning Nothing when this happens.
Which seems very similar to the problem fixed in parallel get can fail some downloads and require re-getting . Which really seemed to be fixed back then..
This is not the same as the previous bug; the STM code seems ok. It is finding an open exclusive lock handle in the STM lock pool for the transfer lock file, and so tryLockExclusive fails.
Instrumented tryLockExclusive, and it's somehow being called twice or more on the same transfer lock file despite there being no duplicate keys.
Aha.. sizeOfDownloadsInProgress calls getTransfers, which locks the transfer info files in order to check which transfers are running. So one of the other worker threads is calling that at just the wrong time, and so contending with the thread that is starting up the transfer.
Bug was introduced by 3cd47f997873ff9d50b35c0f4440763364766d93.
It's interesting that the transfer info file is being created before the transfer lock file. If the lock file were always created first, then getTransfers would not see the transfer before its lock file is locked, and this bug would be avoided. On the other hand, this exact ordering of file creation is why 3cd47f997873ff9d50b35c0f4440763364766d93 is necessary.
There are a couple of ways the files could be created in the wrong order. 3cd47f997873ff9d50b35c0f4440763364766d93 commit message describes one, which does not apply to the test case for this bug.
Hmm, mkProgressUpdater was very recently changed to write the transfer info file, in 24df95f0f6ab474119aff3bbd942251373754ab2, and that comes before the transfer lock is created. That is probably the recent change that exposed this mess.
Ok, fixed the reversion in a fairly decent way, verified with 1000 files and -J10.
cool, thanks, will test it out would this one somehow reveal itself also in non-J mode? (just doublechecking)
This same bug can happen without -J if two git-annex processes are running at the same time and both downloading (different) files.
So, the way datalad special remote runs git-annex, for example...