I could have bet all my savings (well, not that much anyways - kids keep eating) that there was a way to give different costs for different URLs. I believe it was in the same spirit as remote.<name>.annex-cost-command
- that there could be a command which then be used to prioritize one url over another. But I fail to find any hint on it
e.g. nothing in
cost
related config vars
$> grep '^\*.*cost' doc/git-annex.mdwn
* `remote.<name>.annex-cost`
* `remote.<name>.annex-cost-command`
$> grep '^\*.*web' doc/git-annex.mdwn
* `webapp`
* `remote.<name>.annex-webdav`
* `annex.web-options`
$> grep '^\*.*url' doc/git-annex.mdwn
* `addurl [url ...]`
* `rmurl file url`
* `importfeed [url ...]`
* `registerurl [key url]`
* `unregisterurl [key url]`
* `remote.<name>.annex-rsyncurl`
* `annex.security.allowed-url-schemes`
but even with such command it would have made it not really fit my need -- I would like to assign costs and propagate that information through the clones without requiring users to install additional commands or configure their clones.
Use case:
For the same file we provide API end-point which redirects to a minted url (with content-disposition etc), and file could be accessed directly from that url. E.g. both https://api.dandiarchive.org/api/dandisets/000027/versions/draft/assets/ff453f4c-a435-4a5d-a48b-128abca5ec47/download/ and https://dandiarchive.s3.amazonaws.com/blobs/2db/af0/2dbaf0fd-5003-4a0a-b4c0-bc8cdbdb3826 point to the same (small) file.
I would like git-annex to know both, since if we migrate backend storage, I would like users to still be able to access via api url. But by default, while still works, I would like them to access via direct url to s3.
ATM, regardless of the order in which I add those two urls to a file, git-annex seems (didn't look in the code etc) to use them in a "sorted" order, so it would go for the API one, thus causing unnecessary url minting etc. I would like to make direct url having lower cost so it would be preferred over the api one.
May be there is a way already?
If not, I think it could be flexibly achieved if in git annex config
I could provide url regexes with costs. Then I could encode that information while allowing users/clones to tune it if becomes desired. If possible, smth like
# assuming default cost of web remote 100 if it is
cost url-https?://api.dandiarchive.org/.* = 200
or alike in my case. May be it could even be like [+-]COST
so it would then add or subtract from the cost of the remote, thus allowing to account for the remote cost if assigned/set and costs are considered across remotes and their URLs.
Alternative could be - a cost per url (while registerurl or addurl) but IMHO that might be too much, and harder to tune per specific clone "globally".
PS question -- how to see the costs? annex whereis --json FILE
seems to not provide them.
Closing this since yarik said my suggestion would work. Even though I think neither of us is entirely satisfied with it, I can't think of a better approach. done --Joey
Costs are information about the connection from the local repository to a remote, which is why they are stored only locally in the git config -- I may have much different cost than you to access the same repository.
When there are multiple urls all claimed by the web remote, that's a single remote, and the code that looks at costs decides which remote is lowest cost, and tries that one first. When it gets down to the web remote, it tries the urls in whatever order it happens to have them.
So, the proposed way to assign costs to urls could only change the order that urls are presented to the web remote. It would not let git-annex try to get first from web (cheap url), followed by another repo, followed by web (expensive url).
What you can do is have a special remote, that claims all the relevent urls and then does its own ordering of them. Or perhaps different special remotes that claim different sets of urls, and assign costs to those remotes. (Using --sameas so the several remotes do not count as more than 1 copy.)
FWIW, verified that
git annex --debug initremote --sameas web datalad externaltype=datalad type=external encryption=none autoenable=true
makesgit-annex
to makedatalad
special remote to handle those urls. And since we do not have any prioritization handling in datalad we also grab the first one (the api. one) returned by git-annex and proceed with it.So, indeed, if you do not like (or even just feel lukewarm) about an idea of adding costs within built-in
web
remote, feel welcome to close, and we will still have a way forward by providing such handling within datalad external special remote. It would be a bit sub-optimal since would require people to install datalad, but at least it would enable desired prioritization in some use cases (e.g. for QAannex fsck --fast
run).And indeed with the singular cost (not even a range of costs) assigned/returned by a remote and no e.g. cost provisioned to be returned by CLAIMURL, I guess there is no (easy) way to mix-in the URL based costs into overall decision making to order the remotes.
NB with
--sameas
trick above,git-annex
doesn't even askdatalad
with CLAIMURL and immediately passesTRANSFER
of the key todatalad
external remote. Without--sameas
-git-annex
(8.20210330-g0b03b3d) doesn't even bother asking datalad (withinwhereis
at least) on either it could CLAIMURL those... even if I assignannex-cost = 1.0
for datalad remote. Not sure yet if that is "by design".FWIW - I think I have tried to add them in different orders but it always went for the
api
one so I concluded that the order it has them is sorted and there is no way to "tune it up".P.S. I still wonder why I have some memory of git-annex supporting some (external) way to prioritize URLs... may be it was indeed "craft a special remote to do that"...