Per our brief discussion ATM git-annex allows to prioritize URLs only by assigning them to be handled by different special remotes and having different costs for those different remotes.
This doesn't allow for e.g. prioritization within built-in special "web" remote which is the most frequent use case. Our use case:
(base) dandi@drogon:/mnt/backup/dandi/dandizarrs/ea8c43c7-757e-4653-8e4a-a6d356120836$ git annex whereis 0/0/0/3/6/169
(recording state in git...)
whereis 0/0/0/3/6/169 (2 copies)
00000000-0000-0000-0000-000000000001 -- web
86da9d10-da54-4371-8d6f-7559c6a236f5 -- dandi@drogon:/mnt/backup/dandi/dandizarrs/ea8c43c7-757e-4653-8e4a-a6d356120836 [here]
web: https://api.dandiarchive.org/api/zarr/ea8c43c7-757e-4653-8e4a-a6d356120836.zarr/0/0/0/3/6/169
web: https://dandiarchive.s3.amazonaws.com/zarr/ea8c43c7-757e-4653-8e4a-a6d356120836/0/0/0/3/6/169?versionId=h3qb0rOswsssHxEdfN8QAWUMoVhddQrY
ok
where we have API-server based URL -- we do not want to access unless really really needed (would be the slowest, would bring load to the server etc), and then direct access to public bucket -- fastest (unless some other local remote has it even better).
Joey envisioned potentially being able to assign priorities via e.g.
git-annex enableremote web url-priority-1=s3.amazonaws.com/ url-prority-2=/api.dandiarchive.org/
but I also wondered if there could just be some way to provide costs (or adjustments to costs) for different URLs so they all become considered while considering costs across remotes?
E.g. may be I have a URL which is fast (s3 bucket), then I have bunch of average regular remotes with decent speed (e.g. dropbox etc), and then URL to some slow archive (API server). Both urls are served by web
remote, and there would be no way to "order" all data access schemes/remotes in the optimal sequence of costs unless different URLs could have different costs considered along with different remotes.
PS somehow I have some odd memory of seeing some config option to provide git-annex a script which would output cost given a URL... I disliked that approach since it would require me to code the script, and thus didn't use it. Did I dream it up?
It does happen to try the urls in the order listed in the log file.
With that said, the order of lines in files in the git-annex branch is not guaranteed to be preserved when eg, merging..
See previous discussion in the since-closed todo assign costs per URL or better repo-wide (regexes).
In that I suggested using --sameas, and yarikoptic thought the datalad special remote could use that to be used to handle urls and do its own prioritization. I suppose that probably didn't get done in datalad.
But I do like the idea of using --sameas. It avoids several problems with providing "url-priority-N=" configs to the web special remote:
So, what if it were possible to initremote versions of the web special remote that were limited to particular urls, and skipped over any other urls:
As well as adding the urllimit= config, that would need the web special remote to allow initremote of other instances of it. Currently, that will fail:
Which is not ideal when it comes to using autoenable=true because using a current git-annex after this gets implemented would try to autoenable the remote, and display all that. Compare with how autoenable handles remote types it does not know -- it silently skips them. This could be avoided by using something other than type=web for these.
I've implemented support for multiple web special remotes, and have added configurations urlinclude= and urlexclude= (both case-insensitive globs).
Example use:
And then
git-annex get --from fasthost
will only use urls on that host, not any other urls.git-annex get --from web
will still use any urls. The cost of 150 makesgit-annex get
use fasthost before web.That's enough to handle the example you gave, just use `urlinclude='//dandiarchive.s3.amazonaws.com/'
But, I don't think this is quite sufficient. Because it should also be possible to deprioritize urls. And there's not a good way to yet.
In particular, this doesn't work:
Because when getting a file, the main web special remote is tried before this high-cost slowhost one, and will use any url, including slowhost.com urls.
Now you can instead do this:
But when there's a second slow host, that approach falls down, because you can't specify urlexclude= twice. And even if you could, there would be a distributed configs merging issue same as discussed in comment #3.
I think what's needed is for the main web special remote to notice that a web remote such as fastweb or slowweb exists, and automatically exclude from using the urls that other web remote is configured to use. Which will be a little bit tricky to implent, but seems doable.
can't I just give the entire
web
remote the highest cost so it is then considered last by default and all the prior "*web" remotes with custom costs for fast/slow would be considered in their "weighted" fashion?Older versions of git-annex will fail with the error shown in comment #3 when trying to enable the second instance of the web remote.
When autoenabling, it won't be a fatal error, but the ugly error message will still be shown.
Setting default web to higher cost than the other web remotes doesn't avoid this problem: If you have a bunch of urls on different sites that are an averge speed, and one site that is slow, you'd have to make a new web remote for each individual site, and for each new site.
100-some lines of code later, I got deprioritizing urls working well. Eg:
Now when the regular "web" special remote is asked to get a file, it will skip any urls that match the urlinclude of other web remotes. So, it won't use the slowhost.com urls, leaving those for slowweb to later use if necessary.
I think this todo is not fully done yet, because every web special remote that is split out this way still needs to have its cost configured in each clone.
So this seems to be pointing to needing a global way to configure the default cost of a special remote, similar to
git-annex config
. Local configs would of course need to override that.One way that might make sense is to add a cost=N setting to all special remotes. Then when generating the Remote, it can just look at the value set there, and use that for Remote.cost. Simple and efficient too.
(That assumes that only special remotes should have their default cost be configurable, not git repositories. Which seems right, since the same git repo can have different costs depending on whether it's accessed locally or remotely, etc.)
Went ahead and implememented cost=, so now all you need is:
Great, thanks! Should I also
urlexclude
that URL somehow from regular web remote so it is not considered byweb
remote with lower cost?or for each url you would first see which remotes handle it and choose the one with highest cost for that url to be considered?
You do not need to
urlexclude
from regular web remote, it will automatically exclude urls that are anurlinclude
of some other web remote.