I have several workflows that rely on regular key migrations, and I would love to explore some ways that migrating keys could be improved.
I see there has already been discussion about this: alternate keys for same content
I don't know how often this comes up, but it comes up a lot for me. I have several data sources that I regularly index and mirror by constructing keys based on md5 and size, and assemble a repo with the known filename. (gdrive, many software distribution sites, and others).
So, I have a queue-repo, and I have the flexibilty of populating it later. I could even have a queue repo with just URL keys. Then I can handle ingestion and migration later.
I would love to have a simple programatic way of recording that one key is the authoritative key for another key, like for MD5 -> SHA256 migrations.
There don't seem to be any really great solutions to the prolem of obsolete keys. Merging will often re-introduce them, even if they have been excised. Marking them as dead stil keeps them around, and doesn't preserve information about what key now represents the same object.
I have written helper scripts, and tools like git-annex-filter-branch are also very helpful. But I like having the flexibility of many repos that may not regularly be in sync with each other, and a consistent history.
This would break things for sure, but what if during a migration, a symlink was made in the git-annex branch from the prev key to the migrated key. The union merge driver could defer to the upgraded or prefered backend. If an out of date repo tries syncing with an already upgraded key, the merge driver can see that the migration for that key has already happened, merge the obsolete key entries, and overwrite it back to a symlink during merge.
A less drastic approach might be to expand the location log format to indiciate a canonical "successor" key, instead of just being dead.
It might seem like a lot of complexity, but it would also in my opinion make a more consistent and flexible data model.
Hi! If I understand you correctly, your problem is that you often migrate keys to another backend, and there are situations involving merges of repos far away from each other in history that cause merge conflicts, which results in the dead old pre-migration key being reintroduced?
I never use key backend migration and I don't fully understand your workflow. Could you provide a reproducible example of your problem (incl all commands)? This would help a lot.
There seem to be difficulties with both performance and with security in storing information in the git-annex branch to declare that one key is replaced by another one.
I wonder if there are any pain points that could be handled better without recording such information in the git-annex branch. What do your helper scripts do?
I wonder if it would suffice to have a way for git-annex to record that key A migrated to B, but not treat that as meaning that it should get B's content when it wants A, or vice-versa.
Instead, when a repository learns that A was elsewhere migrated to B, it could hardlink its content for A to B and update the location log for B to say is has a copy. The same as if
git-annex migrate
were run locally. (It could even hash the content and verify it got B.)That wouldn't help if a special remote has the content of A, and git-annex wants to get the content of B.
@joey
It isn't a huge problem, but I keep coming back to it. The only workflow I still use where this comes up is for my filesharing assets repo. I just ended up leaving it as MD5E, because much of it is downstream from gdrive shares, and I almost never have all of the content in one place at a time.
This is one of the scripts I sometimes use, although I wrote it awhile ago before I found out about git-annex-filter-branch https://gist.github.com/unqueued/06b5a5c14daa8224a659c5610dce3132
But I mostly rely on splitting off subset repos with no history, processing them in some way, and then re-absorbing them back into a larger repo.
I actually started a repo that would track new builds for Microsoft Dev VMs: https://github.com/unqueued/official-microsoft-vms-annex
But for my bigger repos, I almost never have all of the data in the same place at the same time.
@nobodyinperson
Well, there aren't any conflicts, they just get silently reintroduced, which isn't the end of the world, especially if they get marked as dead. But they clutter the git-annex branch, and over time, with large repos, it may become a problem. There isn't any direct relationship between the previous key and the migrated key.
So, if I have my
linux_isos
repo, and I do git-annex-migrate on it, but say only isos for the year 2021 are in my specific repo at that moment, then the symlinks will be updated and the new sha256 log log files will be added to my git-annex branch.And if you sync with another repo that also has the same files in the backend, they will still be in the repo, but just inaccessible.
And I feel like there's enough information to efficiently track the lifecycle of a key.
I'm exhuming my old scripts and cleaning them up, but essentially, you can get everything you need to assemble an MD5E annex from a Google Drive share by doing
rclone lsjson -R --hash rclone-drive-remote:
And to get the keys, you could pipe it into something like this:
perl -MJSON::PP -ne 'BEGIN { $/ = undef; $j = decode_json(<>); } foreach $e (@{$j}) { next if $e->{"IsDir"} || !exists $e->{"Hashes"}; print "MD5-s" . $e->{"Size"} . "--" . $e->{"Hashes"}->{"MD5"} . "\t" . $e->{"Path"} . "\n"; }'
That's just part of a project I have with a Makefile that indexes, assembles and then optionally re-serves an rclone gdrive remote. I will try to post it later tonight. It was just a project I made for fun.
And there are plenty of other places where you can get enough info to assemble a repo ahead of time, and essentially turn it into a big queue.
You can find all sorts of interesting things to annex.
https://old.reddit.com/r/opendirectories sometimes has interesting stuff.
Here are some public Google Drive shares:
I've spent a while thinking about this and came up with the ideas at distributed migration.
I think that probably would handle your use case.
Whoa, thanks for implementing that Joey! Can't wait to give it a try!
FYI, one of the cases I was talking about before where I repeatedly import keys in MD5E format, is that construct an annex repo, set web urls, and deal with mirroring further down the pipeline.
Code isn't great, just something I threw together years ago: https://github.com/unqueued/annex-drive-share
Because it is gdrive, I can get MD5s and filenames with rclone urls for web remotes.
The way I use it is to init a new annex repo (reusing the same uuid), and then absorb into primary downstream repo overwriting filenames and letting sync update any new keys with the primary repo. Considered using subtree.
It does end up causing merge commits to build up in the git-annex branch, but I might want to run this on a server without sharing an entire repo.
It happens to make sense for me because I have an unlimited @edu gdrive account and it can work great for some workflows as an intermediate file store.