If the same data storage can be accessed via two protocols, two different special remotes could be configured that access the same data, and so should have the same uuids.
This is not possible though, because remote.log is uuid-based and so the special remote configs stored in it for a given uuid would apply to both special remotes.
It is already possible of course for two git remotes to have the same uuid, and also for a special remote and git remotes to have the same uuid.
One approach would be to add some kind of namespace to the configs in remote.log. But this seems problematic, the user would need to juggle remote names and namespaces.
Another approach is to have a way to make two uuids be treated as equivilant. Eg, to make uuid B be treated the same as uuid A.
Suppose there's an equivilance.log that contains "ts B A". Then when git
config is read, a remote with uuid B will result in constuction of a
Remote
with uuid A, but with the RemoteConfig
of uuid B.
Seems like this would be the only place the equivilance.log would need to
be used, once the Remote
is constucted using the equivilant uuid the rest
of the code will work as-is.
That would add overhead of an additional git-annex branch read on every program start. That could be avoided by instead putting the equivilance in the remote.log. Eg, "B sameas=A foo=bar ..."
done; this is implemented as
git annex initremote --sameas
--Joey
Dug a little into implementing this.
One problem is that things like
git annex dead
look up a name in the remote list, and then use the uuid of the returned remote. But if remote foo has sameas=bar-uuid, then the remote in the remote list that it looks up will have that uuid, and so the uuid that will be marked as dead is almost certianly not the one that the user expected.And the user can't pass the masked uuid of the sameas remote to
git-annex dead
, because there will be no remote in the list with that uuid.And for that matter, the user is not likely to know the masked uuid, because things like
git annex info
won't display it..Another gotcha is that the user might make remote B with sameas=A-uuid, and remote C with sameas=B-uuid. Which really needs to resolve to A-uuid, so it needs to do multiple lookups, but then a sameas loop becomes a problem.
Yet another problem with the sameas idea is that old git-annex won't know what the sameas= value means, and will ignore it. So they'd proceed to use the wrong uuid for the special remote. That could result in a big mess.
Also, two remotes using the same underlying data need to be encrypted the same way. Including using the same cipher= value, which is not a value that the user provides. Basically, all the encryption parameters need to be shared between two such remotes.
Also the chunksize parameter needs to be shared, or at least be set on both if not to the same value.
Alternate idea:
The second command would inherit the encryption etc fields from the foo remote, and set up the foo-rsync remote with the same uuid as it. And it would add additional fields to the remote.log:
When enableremote foo-rsync is later run and fails to find a name=foo-rsync, it can look for a remote with the "type+foo-rsync" field, and generate a RemoteConfig with type=rsync rsyncurl=server:/foo encryption=shared cipher=... From there the enableremote would proceed as usual.
(And, if enableremote foo-rsync is passed new/changed parameters, they need to get stored under its namespace.)
I picked + as the separator because it's not likely to be in a remote name (although it could be) and it seems fine to not support field names containing it. (I had first used period, but there may well be special remotes with field names that contain a period.) There's no parsing ambiguity: 'x+y+z=bar' means the x field of a remote named "y+z".
Started a
sameas
branch for this.Logs.Remote.configSet will need some changes because it currently works on the basis of UUID, and so can't know when it's supposed to change a sameas remote. It will need an added RemoteName parameter.
The RemoteConfig is generated each run from the remote.log, and so the handling of sameas remotes needs to be done in Logs.Remote.readRemoteLog not by enableremote.
readRemoteLog makes a
Map UUID RemoteConfig
, which will need to change toMap (UUID, RemoteName) RemoteConfig
Digging into changing readRemoteLog, there are several problems. Here are some of the less tractable ones:
Remote.List.remoteGen looks up RemoteConfig by UUID. While it does have a Git remote and could look up the name of the remote from that, if the user renames a remote in .git/config, that would confuse it. That is not an acceptable tradeoff. So, a sameas remote would need to have some additional git config be set, giving the namespace that's used for it in the remote.log. If that's missing, it un-namespaced. initremote/enableremote need to set that git config.
Annex.SpecialRemote.autoEnable uses readRemoteLog. It would likewise need to look at the git config for namespace to tell which sameas remotes have been auto-enabled.
Preferred content looks at the preferreddir= value from RemoteConfig, and only a uuid is available. So it would have to look at the preferreddir values from all RemoteConfigs for remotes with that uuid and somehow pick one consistently. Or, preferreddir could be inherited like encryption settings are, and not allowed to be set in a sameas remote's config.
Further problem with namespaces: If two people init new sameas remotes with the same uuid at the same time, on merge one of them will be lost.
Revisiting the idea of using different uuids with a sameas= parameter:
If one remote is marked dead, it ought to be the one that sameas= points to, since that's the uuid in the location log. So that's ok.
As long as enableremote does not allow changing the sameas= paramter, sameas loops could only occur maliciously, not in normal operation. So it's fine to break such a loop in an arbitrary way.
There would need to be a way to prevent a remote with sameas= from being used by a version of git-annex that does not support it. One way would be to omit the name= parameter from remote.log, and use some other parameter for the name. Then old git-annex could not enableremote with the wrong uuid.
Using remote.name.annex-uuid-sameas=uuid instead of remote.name.annex-uuid would prevent old git-annex from using initialized sameas remotes. (Need a better name, since the uuid stored there should be the remote's own uuid (needed to get its RemoteConfig), not the one that sameas= points to.)
Seems that encryption parameter inheritance would happen the same way as has been discussed above. When constructing the RemoteConfig, copy over the encryption parameters from the parent remote.
All in all, using separate uuids instead of name= seems perhaps better.
"It is already possible of course for two git remotes to have the same uuid, and also for a special remote and git remotes to have the same uuid" -- but, in general, that's a situation to be avoided, right? Other than two protocols accessing the same datastore, are there times when you'd want that?
(Related:
git-annex-reinit
, reinit current repo to new uuid)Looked into the extent of changes needed for the sameas parameter approach.
The only thing that looks at the "name" parameter is Annex.SpecialRemote, so the new alternative name parameter for sameas remotes can be handled entirely there.
That's good, but its specialRemoteMap will need to be changed since it assumes each uuid has a single associated name, which stops being the case.
Either Annex.SpecialRemote or Logs.Remote.readRemoteLog will need to handle the sameas paramter. Both have their problems. Comment 4 discussed how changing readRemoteLog would cause difficulties for some callers. But if Annex.Special remote handles the sameas parameter, there will be times when a RemoteConfig contains sameas inherited encryption etc, and times when it does not. Would be worth making two different data types for those.
Remote.List.remoteGen gets the cached UUID and looks it up in the readRemoteLog map, so if readRemoteLog does not handle the sameas parameter, that will need to change to use something that does.
(There could be other readRemoteLog users that will similarly be problems.)
Logs.Remote.configSet will need to be changed as discussed in comment 4.
To avoid using remote.name.annex-uuid for sameas remotes, Remote.Helper.Special.gitConfigSpecialRemote will need to somehow know that it's a sameas remote. (It could look at the RemoteConfig for a sameas parameter.)
There are a couple of other places that set remote.name.annex-uuid, like Remote.GCrypt, so will need to factor out all setting of that into something that is sameas-aware.
Per-remote state is an added complication. A sameas remote should not use the same per-remote state, because what's stored in it is up to the remote backend and would conflict.
So Logs.RemoteState would need to use something other than a UUID, which contains the underlying uuid of the sameas remote. (Logs.MetaData too for per-remote metadata.) That would have to be passed in when constructing the remote.
And,
git-annex forget
would need to be made to remote the per-remote state of sameas remotes that point to a dead uuid.It might be possible to isolate the sameas changes only to things involving the location log. Use different uuids for sameas remotes. When updating the location log, substitute the sameas uuid.
There would need to be a sameas-aware way to check if a uuid is in the location log. Currently, loggedLocations is used to both see what remotes to try to get a key from, and for numcopies checking and related stuff (like skipping dropping entirely when loggedLocations does not have enough items in it). So there would need to be two variants of it. That seems likely to be a source of mistakes.
Another small problem with this idea is that a special remote may record its uuid somehow in the data store and check that it has the right uuid later (S3 does this with an "annex-uuid" in the bucket), and if two remotes with different uuids did that, there would be a conflict between them.
Also, it couldn't only be the location log; sameas mapping would also need to be done when using the chunk log. And a bit of encryption config inheritance would still be needed.
Comment 6 talked about how to prevent old git-annex from getting confused when used in a repo with sameas remotes.
If remote.name.annex-uuid contains the uuid that sameas pointed to, then old git-annex will load the RemoteConfig for that uuid. Which is kind of ... ok? The other gitconfig settings for the remote may or may not work with that RemoteConfig. But if accessing that remote fails with old git-annex, no problem. The only concerning thing I think would be if checkpresent somehow reported all content as missing from the remote... But if a misconfiguration of the gitconfig can do that, the special remote implementation is arguably already buggy.
So, I think it's ok to set remote.name.annex-uuid to the sameas uuid. There will need to be a new config key that indicates the uuid to get the RemoteConfig from.
Old git-annex enableremote still needs to be prevented from initializing a sameas remote, as it would set annex-uuid to the wrong uuid.
Looking over all my comments now that I have an implementatation..
git annex dead
on a sameas remote name marks the parent remote dead. I think this is ok; dead means the content is gone, so which remote is used to access it is immaterial; they're all dead.sameas loops are not a problem, it only looks up the sameas-uuid value once, will not loop.
old git-annex are prevented from enabling a sameas remote, since it has no name=
old git-annex with an enabled sameas remote will see the annex-uuid of the parent, and treat it as the parent. Some git config values needed to use the parent may not be set, or may potentially be set differently than for the parent. Unlikely to cause any bad behavior, other than the remote not working.
encrypted data and legacy chunking is inherited, and cannot be overridden
RemoteConfig always contains any inherited parameters of a sameas remote. Logs.Remote.configSet filters those out.
Logs.Remote.configSet is a little bit less safe; if its caller passed the RemoteConfig from a sameas remote, it needs to make sure to not pass the uuid of the parent remote, or it will overwrite the wrong config. All calls to it handle that now.
per-remote state still to be done.