copy-key (--batch) to copy/merge availability info

An ability to copy key availability information from one git-annex'ed repo to another had been a use case we needed for quite some time. Michael even implemented datalad copy-file remotes - that was an immediate use-case whenever it was necessary to craft "custom" smaller datasets from a much larger (super)dataset . And now we came to the same need to facilitate "splitting" a larger dataset into sub-datasets: issues/600.

I think, in particular with https://git-annex.branchable.com/todo/hiding_a_repository/, something like git annex copy-key --from hidden-repo KEY could become the ultimate tool for such operations. Alternatively may be it could be git annex copy-key --from-repo /repo/path KEY so there would not even be a need to link original one as some hidden repo, but rather even "include" original one as a git submodule (e.g. original/).

Such command would need to pretty much do whatever git annex merge already does, just limiting its effects to only relevant key(s) and only the remotes which are known to have that key.

Although I concentrated on "copy-key" pretty much the same feature could be useful to provide a git annex copy-file PATH1 PATH2 ... DEST where PATH?s would be paths in other git-annex repo(s) -- so it would pretty be similar (just better since not only URLs copied) to datalad copy-file.

--batch mode in both cases would be super handy to streamline use of the command(s).

WDYT Joey?

In progress in the filter-branch branch. --Joey

done --Joey

RSS Atom

comment 1

If I understand correctly, this might be as straightforard as extracting the various per-key logs from the git-annex branch, and shoving them into a new git tree, which the user can then do whatever with. Something like:

git-annex copy-key path/
8e1c53342eb6461eedf13ee7d15038b400c70269

The obvious thing for the user to do with such a tree would be to git push it to the other repo with a name ending in /git-annex, which git-annex there will then auto-merge into that repo's git-annex branch.

When you start talking about hidden repos though, things get more complex, because exporting key logs from a hidden repo would necessarily expose the uuid of that repo, and maybe other logged information that was only stored in the hidden repo as well. If it has to filter it to only the parts of logs that are used by non-hidden repos, that would be a lot more complex.

It also seems to me that, if you're splitting a repo, you would also want to include things like trust.log and remote.log, or at least parts of them for some remotes?

Comment by joey — Tue May 4 14:58:17 2021

Remove comment

comment 2

It also seems to me that, if you're splitting a repo, you would also want to include things like trust.log and remote.log, or at least parts of them for some remotes?

yes. Even if not splitting but just copying a key (or multiple keys) since might need special remote configuration etc.

Comment by yarikoptic — Tue May 4 17:50:45 2021

Remove comment

comment 3

Hmm, it seems possible that two repos could use the same uuid for a remote, but have different configurations for it. Eg, an internal use repo that might even embed creds for the remote, and a public use repo that relies on public http urls to download from the remote.

So there would then be 3 things that need to be able to be specified:

keys to copy
uuids whose per-key information should be copied (or ones to skip)
uuids whose non-per-key information should be copied (or ones to skip) (remote description, special remote config, trust, group, preferred content, etc)

Might as well add, for completeness:

whether to copy global config settings, or not (numcopies, mincopies, git-annex-config, group-preferred-content, difference.log)

Could get more granular than this, eg only copying some metadata fields and not others, or description but not trust log, but I'd want to see a use case. A line has to be drawn somewhere or it just gets ridiculous, and the user might as well pull up internals and git-filter-branch and post-process the tree generated by this command.

So a UI for these 3 or 4 things..

git-annex filter-branch --keys-from=path 
    --include-key-information-for=repo
    --exclude-key-information-for=repo
    --include-config-for=repo
    --exclude-config-for=repo
    --include-global-config
    --exclude-global-config

Eg:

git-annex filter-branch --keys-from=.
    --exclude-key-information-for=privateremote
    --exclude-config-for=privateremote
    --include-global-config

Comment by joey — Thu May 13 16:10:39 2021

Remove comment

comment 4

The other axis is, I guess, should it include past commits to the git-annex branch, or only the current data? I'm inclined toward only the current data. The only thing that uses past data really is git-annex log and it's just not worth the added time expense. And also git annex forget already throws away the past data.

There is the added wart of exported treeishes being grafted into the git-annex branch (to avoid them being lost in GC in some edge cases). It would need to do like git annex forget was recently fixed to, and include those grafts when throwing away the rest of the history. (See 8e7dc958d20861a91562918e24e071f70d34cf5b)

Comment by joey — Thu May 13 16:29:41 2021

Remove comment

comment 5

The filtering of uuids from logs this command needs is very closely related to how the git-annex branch is filtered when dropping dead uuids and keys.

Annex.Branch.Transitions.dropDead could alsmost be used as-is, just providing it a trustmap that has the excluded uuids marked as dead.

But, it does not currently modify the trustLog, which makes sense for transitions, but for this the trust log needs to include only the desired uuids.

And, providing a trustmap does have the problem that, if a uuid is mentioned in the branch without being in uuid.log, it would not be in the trustmap, and so it would not be excluded. One way for that to happen is well, using this command to copy only per-key info for a remote, but not config for a remote. Hmm. Using a filtering function, rather than a trustmap, would avoid this problem. But, dropDead does some processing to handle sameas-uuid pointing to a dead uuid, including a special case involving remoteLog.

Implementation plan:

Address above problems with dropDead, somehow, so it can be reused. (done; refactored to filterBranch)
Add a function (in Logs) from a key to all possible git-annex branch log files for that key. (done; keyLogFiles)
For each key seeked, run that function, query the branch to see which log files exist, and pass through dropDead to filter and populate the temporary index. This way, the command does not need to buffer the whole set of keys in memory.
Pass nonKeyLogFiles through dropDead as well.
Refactor regraftexports from Annex.Branch, and call it after constructing the filtered index.

Comment by joey — Thu May 13 16:55:36 2021

Remove comment

comment 6

This is implemented.

Comment by joey — Mon May 17 18:15:55 2021

Remove comment