Recent comments posted to this site:

Even if the stuff with special remotes turned out to be too complicated to implement, git-annex migrate --update would be useful for some users. So it's worth implementing the mapping and then incrementally implementing these ideas.

Comment by joey Fri Dec 1 19:00:41 2023

I've spent a while thinking about this and came up with the ideas at distributed migration.

I think that probably would handle your use case.

Comment by joey Fri Dec 1 18:42:07 2023
Comment by joey Fri Dec 1 18:41:30 2023

About the idea of recording a checksum of the content of a URL or WORM key, without migrating to a SHA key, that does seem worth considering. (And maybe was the original idea of this todo really..)

If that were implemented, it would be necessary for more than one checksum to be able to be recorded for a given URL key. Because different clones might get different content from the URL and each add its checksum.

So, this would not be as strong an assurance as using a SHA key that you're referring to a specific peice of data. It would be useful to protect against bit rot, but not as a way to pin a file to a particular version. Which is often something one does want to do in a git repository!

I do think that implementing that would be a lot simpler. And it would only affect performance when verifying the content of URL or WORM keys, when it would need to look up the checksum in the git-annex branch.

Comment by joey Fri Dec 1 18:00:20 2023

I did a small experiment to gauge how much the git repo size would grow if migration were recorded in log files in the git-annex branch.

In my experiment, I started with 1000 files using sha256. The size of the git objects (after repack by git gc --aggressive) was 0.5 mb. I then migrated them to sha512, which increased the size of git objects to 1.1 mb (after repacking).

Then I recorded in the git-annex branch additional log files for each of the sha512 keys that contained the corresponding sha256 key. That grew the git objects to 1.4 mb after repacking.

This was a little disappointing. I'd hoped that repacking would avoid duplication of the sha256 keys, which are both in the log files I wrote and are used as filenames. But the data I wrote to the logs is only 75 kb total, and git grew 4x that.

I tried the same thing except instead of separate log files I added to git one log file that contained pairs of sha256 and sha512 keys. That log file was 213 kb and adding it to the git repo grew it by 102 kb. So there was some compression there, but less than I would have hoped, and not much better than just gzip -9 of the log file (113 kb). Of course putting all the migration information in a single file like this would add a lot of
complexity to accessing it.

So adding this information to the git-annex branch would involve at best around a 16% overhead, which is a surprising amount.

(It would be possible to make git-annex forget --drop-dead remove the information about old migrated keys if they later get marked as dead, and so regain the space.)

This is also rather redundant information to store in git, since most of the time when file foo has been migrated, the old key can be determined by looking at git log foo. Not always of course because foo might have been renamed after migration, for example.

Another way to store migration information in the git-annex branch would to be graft in the pre-migration tree and the post-migration tree. Diffing those two trees would show what migrated, and most of the time this would use almost no additional space in git, because the user will have committed both those trees anyway, or something very close to them. But it would be more expensive to extract the migration information then, and this would need a local cache of migrations to be built up from examining those diffs..

Comment by joey Fri Dec 1 16:10:11 2023

How about having git annex info --fast skip this lookup step for remotes where it doesn't know the UUID of yet?

git annex info can already be quite slow in the other steps it takes (counting files, disk space, etc.) in large repos, so it is not so much of a surprise that it hangs a while by default. But if --fast would make it actually fast by staying completely offline (right?) and skipping the slow local counting steps, this would be logical.

Comment by nobodyinperson Fri Dec 1 11:50:25 2023

I've had an idea on this: Why not only update UUIDs on (manual) sync/fetch?

This would be in line with how git interacts with regular remotes otherwise too; always requiring an explicit fetch to update its info.

To me it just violates the principle of least surprise to have git-annex try and reach remotes when running something as simple as info.

Comment by Atemu Fri Dec 1 10:21:09 2023

@joey

It isn't a huge problem, but I keep coming back to it. The only workflow I still use where this comes up is for my filesharing assets repo. I just ended up leaving it as MD5E, because much of it is downstream from gdrive shares, and I almost never have all of the content in one place at a time.

This is one of the scripts I sometimes use, although I wrote it awhile ago before I found out about git-annex-filter-branch https://gist.github.com/unqueued/06b5a5c14daa8224a659c5610dce3132

But I mostly rely on splitting off subset repos with no history, processing them in some way, and then re-absorbing them back into a larger repo.

I actually started a repo that would track new builds for Microsoft Dev VMs: https://github.com/unqueued/official-microsoft-vms-annex

But for my bigger repos, I almost never have all of the data in the same place at the same time.

@nobodyinperson

Hi! If I understand you correctly, your problem is that you often migrate keys to another backend, and there are situations involving merges of repos far away from each other in history that cause merge conflicts, which results in the dead old pre-migration key being reintroduced?

Well, there aren't any conflicts, they just get silently reintroduced, which isn't the end of the world, especially if they get marked as dead. But they clutter the git-annex branch, and over time, with large repos, it may become a problem. There isn't any direct relationship between the previous key and the migrated key.

So, if I have my linux_isos repo, and I do git-annex-migrate on it, but say only isos for the year 2021 are in my specific repo at that moment, then the symlinks will be updated and the new sha256 log log files will be added to my git-annex branch.

And if you sync with another repo that also has the same files in the backend, they will still be in the repo, but just inaccessible.

And I feel like there's enough information to efficiently track the lifecycle of a key.

I'm exhuming my old scripts and cleaning them up, but essentially, you can get everything you need to assemble an MD5E annex from a Google Drive share by doing rclone lsjson -R --hash rclone-drive-remote:

And to get the keys, you could pipe it into something like this: perl -MJSON::PP -ne 'BEGIN { $/ = undef; $j = decode_json(<>); } foreach $e (@{$j}) { next if $e->{"IsDir"} || !exists $e->{"Hashes"}; print "MD5-s" . $e->{"Size"} . "--" . $e->{"Hashes"}->{"MD5"} . "\t" . $e->{"Path"} . "\n"; }'

That's just part of a project I have with a Makefile that indexes, assembles and then optionally re-serves an rclone gdrive remote. I will try to post it later tonight. It was just a project I made for fun.

And there are plenty of other places where you can get enough info to assemble a repo ahead of time, and essentially turn it into a big queue.

You can find all sorts of interesting things to annex.

https://old.reddit.com/r/opendirectories sometimes has interesting stuff.

Here are some public Google Drive shares:

Comment by unqueued Fri Dec 1 02:09:07 2023

I wonder if it would suffice to have a way for git-annex to record that key A migrated to B, but not treat that as meaning that it should get B's content when it wants A, or vice-versa.

Instead, when a repository learns that A was elsewhere migrated to B, it could hardlink its content for A to B and update the location log for B to say is has a copy. The same as if git-annex migrate were run locally. (It could even hash the content and verify it got B.)

That wouldn't help if a special remote has the content of A, and git-annex wants to get the content of B.

Comment by joey Thu Nov 30 20:55:02 2023

There seem to be difficulties with both performance and with security in storing information in the git-annex branch to declare that one key is replaced by another one.

I wonder if there are any pain points that could be handled better without recording such information in the git-annex branch. What do your helper scripts do?

Comment by joey Thu Nov 30 20:49:53 2023