Spent the past two weeks on the ?sqlite database improvements which will be git-annex v8.
That cleaned up a significant amount of technical debt. I had made some bad
choices about encoding sqlite data early on, and the persistent library
turns out to make a dubious choice about how String is stored, that
prevents some unicode surrigate code points from roundtripping sometimes.
On top of those problems, there were some missing indexes. And then to
resolve the git add
mess, I had to write a raw SQL query that used LIKE,
which was super ugly, slow, and not indexed.
Really good to get all that resolved. And I have microbenchmarks that are good too; 10-25% speedup across the board for database operations.
The tricky thing was that, due to the encoding problem, both filenames and keys stored in the old sqlite databases can't be trusted to be valid. This ruled out a database migration because it could leave a repo with bad old data in it. Instead, the old databases have to be thrown away, and the upgrade has to somehow build new databases that contain all the necessary data. Seems a tall order, but luckily git-annex is a distributed system and so the databases are used as a local fast cache for information that can be looked up more slowly from git. Well, mostly. Sometimes the databases are used for data that has not yet been committed to git, or that is local to a single repo.
So I had to find solutions to a lot of hairly problems. In a couple cases, the solutions involve git-annex doing more work after the upgrade for a while, until it is able to fully regenerate the data that was stored in the old databases.
One nice thing about this approach is that, if I ever need to change the sqlite databases again, I can reuse the same code to delete the old and regnerate the new, rather than writing migration code specific to a given database change.
Anyway, v8 is all ready to merge, but I'm inclined to sit on it for a month or two, to avoid upgrade fatigue. Also I find more ways to improve the database schema. Perhaps it would be worth it to do some normalization, and/or move everything into a single large database rather than the current smattering of unnormalized databases?
If rebuilding the database is an operation that will take some time, it might be nice to have deprecation warnings in the web app before the assistant eventually autoupgrades repos.
I use the assistant/web app to manage about 10 repos, some of which are on the larger size both in terms of disk space and number of files. I suspect that if the assistant kicked off an upgrade with a database rebuild on all of these at once, it would have a noticeable performance impact on my machine. If, after v8 is merged but before the assistant autoupgrades, the web app displayed a message like "This repository is using a version that will soon be upgraded, click here to learn more about v8 and consider upgrading", it would give folks (who don't read the devblog or release notes) a heads up and give them a chance to manually upgrade repositories one by one.
Hi Joey, do you think it would be possible to squeeze a solution for the missing symlinks for not-present unlocked files in v8?
Symlinks for missing unlocked files are the last thing missing from V5 for those of us simulating direct mode with adjusted branches and
annex.thin
. For reference https://git-annex.branchable.com/devblog/day_601__v7_default/ and https://git-annex.branchable.com/todo/symlinks_for_not-present_unlocked_files/