extend external special remote protocol so a process can refuse to PREPARE, making -J not use it concurrently

Please describe the problem.

Sorry if it doesn't really fit into the "bug" category, although may be I could have titled it "does not consider another instance of the same special remote in case one fails to initialize in parallel (-J) mode" may be?

What steps will reproduce the problem?

In datalad-archives we probably should restrict to having just a single instance of that special remote running to avoid messing with tracking of which instance is taking care about which archive and stage (download, extraction, etc). So I thought I would just implement that by assuring that there is a single instance of the special remote running per repository and all others just reporting 'PREPARE-FAILURE' and then git-annex would try another instance it has in the pool.

But it seems to be not the case:

[2018-03-05 14:23:06.725630284] chat: /home/yoh/proj/datalad/datalad/venvs/dev/bin/git-annex-remote-datalad-archives []
[2018-03-05 14:23:07.019016095] git-annex-remote-datalad-archives[2] --> VERSION 1
[2018-03-05 14:23:07.019267772] git-annex-remote-datalad-archives[2] <-- EXTENSIONS INFO
[2018-03-05 14:23:07.019461422] git-annex-remote-datalad-archives[2] --> UNSUPPORTED-REQUEST
[2018-03-05 14:23:07.019596061] git-annex-remote-datalad-archives[2] <-- PREPARE
[2018-03-05 14:23:07.019643377] git-annex-remote-datalad-archives[1] --> VERSION 1
[2018-03-05 14:23:07.01973844] git-annex-remote-datalad-archives[1] <-- EXTENSIONS INFO
[2018-03-05 14:23:07.019888855] git-annex-remote-datalad-archives[1] --> UNSUPPORTED-REQUEST
[2018-03-05 14:23:07.019917528] git-annex-remote-datalad-archives[2] --> PREPARE-SUCCESS
[2018-03-05 14:23:07.019972279] git-annex-remote-datalad-archives[1] <-- PREPARE
[2018-03-05 14:23:07.02001778] git-annex-remote-datalad-archives[2] <-- TRANSFER RETRIEVE MD5E-s5238--ead47341c9363e7f49c1a50c895d170b.txt .git/annex/tmp/MD5E-s5238--ead47341c9363e7f49c1a50c895d170b.txt
[2018-03-05 14:23:0detailed-methods.txt (from datalad-archives...) ] --> PREPARE-FAILURE Failed to prepare <datalad.customremotes.archives.ArchiveAnnexCustomRemote object at 0x7f7c1ce84390> due to Cannot lock repo /home/yoh/datalad__/crcn
get docs/crcns-aa1-conditions.txt (from datalad-archives...) failed
get docs/crcns-aa1-detailed-methods.txt (from datalad-archives...)

if one instance fails -- annex immediately reports that the url failed to download and doesn't try another instance it had to be ran in parallel.

I wondered if there is an easy way to restrict a single instance for some special remotes (may be with recently added "FEATURES" to describe special remotes) or adjust parallel download logic to loop through available (not failed) instances thus reusing only one instance if all others fail?

What version of git-annex are you using? On what operating system?

6.20180220+gitg811d0d313-1~ndall+1

rejected --Joey

RSS Atom

comment 1

When I try this with ssh remotes, which should act the same as external special remotes as far as git-annex get behavior is concerned, git-annex get moves on to the next remote that has the file when the first one fails.

I think the difference might come down to the handling of the failed PREPARE. That throws an exception, so terminates the get action for that file.

Indeed, there are quite a few giveup in Remote/External.hs, and while some of them are reasonable exceptions to throw, it needs an audit for ones that throw an exception when there's a better way to indicate failure like return False.

Comment by joey — Tue Mar 6 16:08:23 2018

Remove comment

comment 2

Ok, audited all exceptions thrown in there, and the only other one that stood out is that TRANSFER-FAILURE throws an exception -- but that one is ok because it's in a Retriever, all of which exceptions are caught.

But hmm, PREPARE-FAILURE throwing an exception when git-annex is preparing before retrieving a key is in the same Retriever, so that exception also should not be a problem. There might be other cases where PREPARE-FAILURE throwing an exception is not desirable, but this is not one.

Oh, I see, the bug report is really about some -J specific issue!

I'm having difficulty reproducing it. I have two external special remotes that both send PREPARE-FAILURE, and when a file is present in both, git annex get -J2 does try both of them:

[2018-03-06 12:45:21.123121045] chat: /home/joey/bin/git-annex-remote-directory []
[2018-03-06 12:45:21.125524418] git-annex-remote-directory[1] --> VERSION 1
[2018-03-06 12:45:21.125702152] git-annex-remote-directory[1] <-- EXTENSIONS INFO
[2018-03-06 12:45:21.126005492] git-annex-remote-directory[1] --> UNSUPPORTED-REQUEST
[2018-03-06 12:45:21.126186941] git-annex-remote-directory[1] <-- PREPARE
[2018-03-06 12:45:21.126518978] git-annex-remote-directory[1] --> PREPARE_FAILURE OOK
[2018-03-06 12:45:21.129247706] chat: /home/joey/bin/git-annex-remote-directory []
[2018-03-06 12:45:21.131475605] git-annex-remote-directory[1] --> VERSION 1
[2018-03-06 12:45:21.131636898] git-annex-remote-directory[1] <-- EXTENSIONS INFO
[2018-03-06 12:45:21.131848068] git-annex-remote-directory[1] --> UNSUPPORTED-REQUEST
[2018-03-06 12:45:21.131953342] git-annex-remote-directory[1] <-- PREPARE
[2018-03-06 12:45:21.132127177] git-annex-remote-directory[1] --> PREPARE_FAILURE OOK
get foo (from d1...) (from d2...)
  Unable to access these remotes: d1, d2

Comment by joey — Tue Mar 6 16:33:41 2018

Remove comment

comment 3

re reproduce: may be it is relevant that in my case it is THE SAME (the same uuid, hardcoded similarly to web "remote" uuid) remote.

re locking: have some ugly (almost) working fix -- but yet to troubleshoot why some files still manage to get "failed" status (may be it is somehow related to above)

Comment by yarikoptic — Tue Mar 6 19:07:37 2018

Remove comment

comment 4

Ok so I misunderstood this, it's entirely about -J mode and external special remote processes. One is started per parallelism level, because otherwise nothing much would actually be done in parallel.

You wanted to start only one because reasons, which IIRC from talking with you were in the meantime fixed in your special remore program using locking.

Comment by joey — Mon Apr 2 17:13:33 2018

Remove comment

comment 5

PREPARE-FAILURE is documented as "the special remote cannot be used" and so I don't think it makes sense for git-annex to use the previously started instance of the program if a later one fails like that.

It would need a new response to PREPARE. And some possibly not insignificant changes in Remote/External.hs to support it. In particular, Remote/External.hs currently delays sending PREPARE until the first time it uses a special remote, but this seems to need PREPARE to be sent earlier, when it starts up the special remote, so it can detect the new response and remember that it should not try to start up any more concurrent instances and instead use any already started instance.

The best argument for doing it, I think, is if several different external special remote programs really only support a single instance running at a time, and if supporting that inside git-annex would be enough of a win, rather than making those programs do their own locking.

Hmm, an external special remote program can't just block its response to PREPARE when another instance is running, because it would never be able to un-block. So it seems they would have to use finer-grained locking when responding to eg TRANSFER. I don't know if anything other than datalad needs such locking (and IIUC datalad already got the necessary locking), but it does seem like it would be worth adding an extra PREPARE response to avoid needing to so complicate external special remote programs.

Comment by joey — Tue Apr 3 19:17:32 2018

Remove comment

comment 6

So the idea here is to add a PREPARE-NONCONCURRENT or whatever for the external to use to signal that it's not able to do anything because another one is already running concurrently.

If the first instance that git-annex starts up responds that way, I suppose it would just be treated the same as PREPARE-FAILURE.

So, if an external is being used for one git-annex command, another git-annex command that also needs to use it would just fail to start it, and so fail entirely.

Just for example, git annex whereis needs to start up the external program to check WHEREIS, and would need to fail if it responded PREPARE-NONCONCURRENT. So a long-running git-annex copy (or even the assistant) would then prevent other git-annex commands from working.

So, seem this is blocked on external remote querying transition.

I'm also not really sure it's worth implementing this, is it really going to be useful compared with external remote programs doing their own locking around whatever operations cannot be run concurrently?

Comment by joey — Thu May 28 16:15:34 2020

Remove comment

comment 7

The workaround I had mentioned above originally seems to work fine for us, so indeed no need for this feature from DataLad side ATM.

Comment by yarikoptic — Thu May 28 23:53:58 2020

Remove comment

comment 8

The async extension to the protocol guarantees only a single process will be run. The remote might be asked to start several operations concurrently, but if it wants to queue them sequentially, that should be fine.

While it might be a bit of a round about way to get this functionality, since the extension complicates the protocol a bit, I'm inclined to feel it's enough, and not add this other extension, at least without some more compelling use case.

Comment by joey — Mon Sep 28 14:00:00 2020

Remove comment

comment 9

Sorry for being anal with my questions and not just trying it out... but by The async extension to the protocol guarantees only a single process will be run. which of the following scenarios do you mean

a. there will be no parallel workers if --jobs N is specified (thus a single process for an external remote)
b. there are parallel workers, and a single instance of an external special remote will be ran and either
- b.1. only one worker would be able to talk to it
- b.2. parallel workers will talk to the same async external special remote

or may be some other c. or b.3 ?

Comment by yarikoptic — Mon Sep 28 19:44:05 2020

Remove comment

Add a comment