Please describe the problem.
Reference: issue/discovery in repronim/containers while adding neurodesk images
- apparently we had no URLs made registered with images despite running
registerurl KEY ANNEX
- some images do have urls
took awhile to grasp what is going on and then I found an unfinished reproducer from Mar 15 2021 annex-claimurl.sh
without recollection why I have not finished it, but it seems that it might be "operator error" somehow? but seems unlikely... might be datalad special remote bug?
Summary of the problem: if there is an external git-annex-remote which CLAIMURL - git-annex registerurl does not associate that URL with any (that external or web) remote and thus does not make that key available to the user despite knowing the url.
Should it btw default to web
if no remote is associated with it?
Filed complimentary registerurl --remote REMOTE TODO since in this case I would have preferred to just register against web remote.
What steps will reproduce the problem?
Here is a new "quick" reproducer but you need datalad being installed to get git-annex-remote-datalad
.
#!/bin/bash
export PS4='> '
set -eu
set -x
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
git init
git annex init
# It works fine if we do not enable datalad special remote!
# so it is something about interaction there
git annex initremote datalad externaltype=datalad type=external encryption=none autoenable=true uuid=65b6c36b-debd-4a23-8fa3-675cbd200496
git annex enableremote datalad
git annex info
# so it seems that addurl does it right
git annex addurl --debug --file 123.dat http://www.oneukrainian.com/tmp/123.dat
# but if I do via registerurl -- not quite so
echo 124 > 124.dat
git annex add 124.dat
key=$(readlink -f 124.dat | xargs basename)
git annex registerurl --debug "$key" http://www.oneukrainian.com/tmp/124.dat
git commit -m 'added those two files with urls'
git annex whereis --debug 123.dat
git annex whereis --debug 124.dat
git checkout git-annex
: # URLs are known for both
git grep oneukrainian
: # but only 123.dat would be associated with datalad remote
git grep 65b6c36b-debd-4a23-8fa3-675cbd200496
With full log here and without --debug
ending up like
❯ grep -v '^\[' annex-claimurl-2023.sh.log | tail -n 29
(recording state in git...)
> git commit -m 'added those two files with urls'
2 files changed, 2 insertions(+)
create mode 120000 123.dat
create mode 120000 124.dat
> git annex whereis --debug 123.dat
whereis 123.dat [2023-03-31 18:29:27.56573965] (Utility.Process) process [1429290] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
(2 copies)
62c53770-5274-40d4-a45a-de308c234ea9 -- yoh@bilena:~/.tmp/dl-FbOrptq [here]
65b6c36b-debd-4a23-8fa3-675cbd200496 -- [datalad]
datalad: http://www.oneukrainian.com/tmp/123.dat
ok
> git annex whereis --debug 124.dat
whereis 124.dat [2023-03-31 18:29:27.857735575] (Utility.Process) process [1429322] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
(1 copy)
62c53770-5274-40d4-a45a-de308c234ea9 -- yoh@bilena:~/.tmp/dl-FbOrptq [here]
ok
> git checkout git-annex
Switched to branch 'git-annex'
> :
> git grep oneukrainian
060/68b/SHA256E-s4--ca2ebdf97d7469496b1f4b78958f9dc8447efdcb623953fee7b6996b762f6fff.dat.log.web:1680301767.477711756s 1 :http://www.oneukrainian.com/tmp/124.dat
ae1/21c/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat.log.web:1680301767.037966322s 1 :http://www.oneukrainian.com/tmp/123.dat
> :
> git grep 65b6c36b-debd-4a23-8fa3-675cbd200496
ae1/21c/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat.log:1680301767.038748415s 1 65b6c36b-debd-4a23-8fa3-675cbd200496
remote.log:65b6c36b-debd-4a23-8fa3-675cbd200496 autoenable=true encryption=none externaltype=datalad name=datalad type=external timestamp=1680301766.517251391s
uuid.log:65b6c36b-debd-4a23-8fa3-675cbd200496 datalad timestamp=1680301765.789226249s
so - both keys have urls, but only 123.dat one is associated with datalad special remote, and only it has url reported by whereis
What version of git-annex are you using? On what operating system?
10.20230126 but tried with older 8.20210803 since thought it must be regression -- the same result
This is intentional, see 451171b7c1eaccfd0f39d4ec1d64c6964613f55a which changed setUrlPresent to only update presence info when the url belongs to the web but not when it's claimed by other special remotes.
It makes sense for registerurl to be symmetric with rmurl, and rmurl only updates presence info when the url is a web url.
To the extent I've been able to follow the complex reasoning there for why, part of it is clear: The web special remote is different from other special remotes in that content cannot be dropped from it by git-annex, and the url is the only pointer to content. So when rmurl removes the last web url, it makes sense to treat the content as no longer present on the web. But if the url is claimed by another special remote, which does support dropping content, the content would still be present on it after removing its url, and would be accessible w/o using that url, and
git-annex fsck --fast --from
would notice it was present and fix up the location log if it didn't show it as content.Also note that the rmurl man page documents this when it says:
All you need to do is use
git-annex setpresentkey
along with registerurl.yet to re-review that reasoning, but does it mean that to merely register a URL client needs to - call
annex registerurl
- inspect to which remote URL was added/was claimed (is there a way?whois
is silent) - if it was claimed by some special remote other than web -- useannex setpresentkey
?Sounds like too much / too fragile, and somewhat different from how
addurl
behaves which does it all just fine regardless either it is web or some claimurl'ed remote.So to some degree it is a regression / broken behavior which initially worked just fine with registerurl -- tried the 6.20180913+git149-g23bd27773 version and it performed "as expected". Eh, never enough tests
I have looked at that commit changelog and detailed description . Not fully grasping yet why
registerurl
should not behave symmetrically withaddurl
in being sufficient by itself to add a url to content so it becomes usable forget
right away, without some other dances likesetpresentkey
. I think I do getrmurl
"ambiguity" but here on that more reflected below.Rereading your comment above:
This is just an assumption on some "special nature of web remote", e.g. the
datalad
remote also doesn't support dropping, and URL is also just the pointer to content. And CLAIMURL functionality came IIRC exactly for that use case and before adding some kind of duality for having content accessible directly from special remote and via url.that is yet another assumption, since e.g. in the case of datalad remote
rmurl
effect would be identical toweb
remote, and there is no other way to get content from that remote. (so there is no duality mentioned above)this somewhat contradicts above "the content would still be present on it after removing its url" which suggests that presence of URL for the remote already sufficient indication of being present on the remote.
Overall, there is seems some assumptions about URLs and external remotes which ideally should be avoided. May be it it should somehow be reflected in the external remote protocol to indicate that CLAIMing URL indicates that it is present at that URL, and that there is no other way to access that content from the remote besides via URL.
As a workaround I of cause will now either
setpresentkey
or will just reassign all urls to be handled directly by web remote somehow. But in the long run I think it is problematic design since evenregisterurl
doesn't even report to which remote that URL was registered toso how could I generally to know proper invocation for
setpresent
key to follow it up?Whups, I forgot about the newish unregisterurl! That's the true inverse of registerurl. So rmurl is really more the inverse of addurl.
I think I've fully understood the situation that led to this reversion now. I do think it was a reversion. That change was all about SETURLPRESENT and SETURLMISSING in the external special remote protocol, as well as rmurl; I think that the effect on registerurl was not considered.
So while I'd like to simplify registerurl to as basic a plumbing command as possible, and would prefer it not to update location tracking, there's the matter of backward compatability. Especially for simple cases like adding regular web urls with it. It would be ok to change it back to update location tracking for remotes that claim an url. As long as unregisterurl can be symmetric with it --- can it?
rmurl also has its own wacky behavior in this area:
Is that a bug? It's certianly not ideal for the bittorrent special remote, which can't download the file once the url is removed. (It is documented behavior though.)
While thinking about those questions, I thought of this situation:
At the end there, it's still able to drop the content from s3.
Now, consider hypothetically, if I decide to make the S3 remote CLAIMURL urls that are in the S3 bucket. As things stand, that won't change the above scenario. (Although the key won't be recorded as located in the web after registerurl.)
But... If unregisterurl is changed to update remote tracking for other remotes than web, after the S3 CLAIMURL change, the behavior of that scenario will not be the same! After unregisterurl, it will no longer consider the content to be present in S3. Now you're racking up S3 charges with content that git-annex stored in S3, but that it refuses to delete. That seems bad.
So, that scenario is leading me to think that I should not change unregisterurl (or rmurl) to update location tracking of remotes other than web. And so changing registerurl is also looking like a bad idea.
What I'm inclined to do is is add a --remote= parameter to registerurl and unregisterurl. If the specified remote does not claim the url, have it fail to add it. (See also registerurl --remote REMOTE)
So, you can then use registerurl with --remote=$uuid, check that it succeeded, and then use setpresentkey to mark it present on that uuid. Without the fragility you complained of.
Update: The --remote parameter is implemented now.
(Could registerurl with --remote update location tracking itself? Maybe, but I'd worry about a scenario like in the previous comment.)
Obviously, as the author of the referenced wishlist, I would welcome addition of
--remote
option to both those commands.But IMHO addition of the option doesn't solve initial/naive/programmable user oriented use case where user doesn't know which remote could or should handle the URL, and just wants, analogously or complimentary to
addurl
, to extend the list of the urls available for some key. There is even no user level interface to ask for "what remotes can handle this url" to erect some tandem of commands to register extra URLs for a key. So I don't see how addition of the option would solve the problem.Well, unregisterurl and rmurl can't safely update location tracking for remotes other than the web. Unless there were some way to know that simply removing an url was sufficient, like it is for the web, and unlike how it would be with my S3 remote scenario above.
But, the only issue with registerurl updating location tracking is that it's not symmetric with unregisterurl.
So is that symmetry more important than comment 6? I don't know. In both cases, some users are going to be surprised by inconsistent behavior.
The only way to avoid all user surprise would be to go back in time and make these plumbing commands not update location tracking from the start.
Guess I'll come down on the side of restoring old behavior which was changed w/o warning (and without the new behavior ever being documented).
And on the side of user experience showing the current behavior is surprising.
The future users who get surprised by the resulting inconsistency of unregisterurl not unsetting location tracking will just have to live with it.. Sigh.