While running git-annex addurl --batch --with-files --jobs 10 --json --json-error-messages --json-progress --raw
, I occasionally run into files that fail to download for no discernable reason, and the "error-messages"
key in the output from the command is an empty list. This makes it hard to figure out exactly why the download is failing.
Is it reproducible with a particular url? Does it only happen with -J?
Version would also be good to know. There were recent relevant changes eg 4f42292b13dc5a6664eeb19b5c9d48991eaef292.
I've spent a while hunting for a code path where it fails without displaying a warning, and have not found one. Since the code in addurl is structured as return Nothing and hopefully display a warning beforehand, rather than as throw an error, it's certianly possible that happens.
--jobs
option is omitted, but that's not viable for the current project we're using git-annex for.Aha, that makes sense! addurl constructs a url-based Key to use while downloading, and the key transfer machinery prevents redundant downloads of the same Key at the same time.
Arguably, the problem is not where the message gets put, but that it fails when adding an url to two different paths at the same time.
I have, though, moved that message so it will appear in error-messages.
The best solution I can find is for it to notice when another thread is downloading the same url, and wait until it finishes. Then proceed with downloading the url for a second time.
It's not very satisfying to re-download. But once the url Key is downloaded, it does not keep that url Key populated, but hashes the content and moves the content to the final Key. It would be a real complication to communicate, across threads, what Key the content ended up at, and have the waiting thread use that. And addurl is already complicated well beyond a point I am comfortable with.
Also, the content of an url can of course change over time. If I feed "$url foo" into git-annex addurl --batch -J10 and then some time later, I feed "$url bar", I might expect that file bar gets whatever content the url has now, not the content that the url had back when I added the same url to file foo. And if I cared about avoiding re-downloading, I could add the url to the first file, and then copy the annex link to the second file myself.
Implemented this approach.