An ability to copy key availability information from one git-annex'ed repo to another had been a use case we needed for quite some time. Michael even implemented datalad copy-file remotes - that was an immediate use-case whenever it was necessary to craft "custom" smaller datasets from a much larger (super)dataset . And now we came to the same need to facilitate "splitting" a larger dataset into sub-datasets: issues/600.
I think, in particular with https://git-annex.branchable.com/todo/hiding_a_repository/, something like git annex copy-key --from hidden-repo KEY
could become the ultimate tool for such operations. Alternatively may be it could be git annex copy-key --from-repo /repo/path KEY
so there would not even be a need to link original one as some hidden repo, but rather even "include" original one as a git submodule
(e.g. original/
).
Such command would need to pretty much do whatever git annex merge
already does, just limiting its effects to only relevant key(s) and only the remotes which are known to have that key.
Although I concentrated on "copy-key" pretty much the same feature could be useful to provide a git annex copy-file PATH1 PATH2 ... DEST
where PATH?s would be paths in other git-annex repo(s) -- so it would pretty be similar (just better since not only URLs copied) to datalad copy-file
.
--batch
mode in both cases would be super handy to streamline use of the command(s).
WDYT Joey?
In progress in the
filter-branch
branch. --Joey
If I understand correctly, this might be as straightforard as extracting the various per-key logs from the git-annex branch, and shoving them into a new git tree, which the user can then do whatever with. Something like:
The obvious thing for the user to do with such a tree would be to git push it to the other repo with a name ending in /git-annex, which git-annex there will then auto-merge into that repo's git-annex branch.
When you start talking about hidden repos though, things get more complex, because exporting key logs from a hidden repo would necessarily expose the uuid of that repo, and maybe other logged information that was only stored in the hidden repo as well. If it has to filter it to only the parts of logs that are used by non-hidden repos, that would be a lot more complex.
It also seems to me that, if you're splitting a repo, you would also want to include things like trust.log and remote.log, or at least parts of them for some remotes?
yes. Even if not splitting but just copying a key (or multiple keys) since might need special remote configuration etc.
Hmm, it seems possible that two repos could use the same uuid for a remote, but have different configurations for it. Eg, an internal use repo that might even embed creds for the remote, and a public use repo that relies on public http urls to download from the remote.
So there would then be 3 things that need to be able to be specified:
Might as well add, for completeness:
Could get more granular than this, eg only copying some metadata fields and not others, or description but not trust log, but I'd want to see a use case. A line has to be drawn somewhere or it just gets ridiculous, and the user might as well pull up internals and git-filter-branch and post-process the tree generated by this command.
So a UI for these 3 or 4 things..
Eg:
The other axis is, I guess, should it include past commits to the git-annex branch, or only the current data? I'm inclined toward only the current data. The only thing that uses past data really is
git-annex log
and it's just not worth the added time expense. And alsogit annex forget
already throws away the past data.There is the added wart of exported treeishes being grafted into the git-annex branch (to avoid them being lost in GC in some edge cases). It would need to do like
git annex forget
was recently fixed to, and include those grafts when throwing away the rest of the history. (See 8e7dc958d20861a91562918e24e071f70d34cf5b)The filtering of uuids from logs this command needs is very closely related to how the git-annex branch is filtered when dropping dead uuids and keys.
Annex.Branch.Transitions.dropDead could alsmost be used as-is, just providing it a trustmap that has the excluded uuids marked as dead.
But, it does not currently modify the trustLog, which makes sense for transitions, but for this the trust log needs to include only the desired uuids.
And, providing a trustmap does have the problem that, if a uuid is mentioned in the branch without being in uuid.log, it would not be in the trustmap, and so it would not be excluded. One way for that to happen is well, using this command to copy only per-key info for a remote, but not config for a remote. Hmm. Using a filtering function, rather than a trustmap, would avoid this problem. But, dropDead does some processing to handle sameas-uuid pointing to a dead uuid, including a special case involving remoteLog.
Implementation plan: