todo/Incremental git annex sync --content --allgit-annexhttp://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/git-annexikiwiki2023-10-25T18:44:59Zcomment 1http://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/comment_1_9d812c6dd2fd7b4c255ea88580da4396/Lukey2021-03-02T22:09:45Z2021-01-19T16:27:07Z
<p>So I implemented this with an bash script, see <a href="https://gist.github.com/Lukey3332/203ea6f30d48323e7bd1d05c16b5da9c">here</a>. It's a bit hacky, since <code>git annex sync</code> doesn't have <code>--batch</code> and even if it did, <code>--batch</code> can't work with keys. Instead it check out a temporary branch, link the keys to sync in a directory and then uses --content-of= to sync only the keys within that directory.</p>
<p>I'll use it with my repos and see if it works reliably.</p>
comment 2http://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/comment_2_9fff9eb8a7a897c5aabe9878f4d0b23b/Lukey2021-07-15T18:40:13Z2021-07-15T18:40:13Z
<p>The script needed a few tweaks, you can find the updated version <a href="https://gist.github.com/Lukey3332/203ea6f30d48323e7bd1d05c16b5da9c">here</a>.</p>
<p>I used the script here since half a year now and everything works fine. I even had a problem with failed transfers due to a bad SATA-cable and the script did handle it properly. <code>git annex fsck --all</code> confirms that everything is alright.</p>
comment 3http://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/comment_3_638f40462d4bb2a447350ce0a5dc7a92/joey2021-07-16T18:02:52Z2021-07-16T17:42:49Z
<p>Thank you for rewording, which should not have been necessary, but seems to
have helped my reading comprehension.</p>
<p>This does seem like a good idea! That diff should be fast and if the
location log changed, it needs to recheck preferred content against the
changed situation, and if it didn't, we know preferred content will have
the same result as currently applies. Elegant.</p>
<p>I suppose it needs to record the branch tip for each remote, because
different remotes can be synced at different times. It can record it
locally, in a hidden ref or something.</p>
<p>Your script checks for changes to the preferred-content.log etc
by storing a copy and comparing it with the current one. But since it knows
the old git-annex branch tip, it can just request a diff of those files
between the old and new shas, eg:</p>
<pre><code>git diff-tree refs/annex/last-sync/origin/git-annex..git-annex --name-only -- preferred-content.log required-content.log etc
</code></pre>
<p>If that outputs anything the logs changed and the optimisation can't be
used.</p>
<p>Weirdly, this will make --all often faster than not using --all, because it
will be able to quickly see there is nothing to do. Occurs to me that
the same method could be used to tell when a non-all sync is a no-op,
and so speed up those, although only in the case where there was a previous
--all sync. Or, it could record a tuple of (tree, git-annex branch), and
use that to speed up non-all syncs, at least of the variety that don't
operate on a specific list of files, but on a whole tree.</p>
comment 4http://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/comment_4_c232e1e1cfcc47f70079f2d32c2b4633/joey2023-10-25T18:44:59Z2023-06-23T16:00:04Z
<p>My recent optimisations of <code>git-annex sync</code> with importtree remotes uses a
similar diffing approach.</p>
<p><code>git-annex satisfy</code> syncs <code>--content</code> by default, so this optimisation would
be especially nice to have for it.</p>
comment 5http://git-annex.branchable.com/todo/Incremental_git_annex_sync_--content_--all/comment_5_e81719f23565579674249db5d0a883da/joey2023-10-25T18:44:59Z2023-10-24T17:26:53Z
<p>To implement this optimisation for a non-all sync, when
the tree being synced has changed, it ought to diff from the old
tree to the current tree, and sync those files. Preferred
content can vary depending on filename, and diffing like that will avoid
scanning every file in the whole tree.</p>
<p>And when there are location log changes, it needs to also sync files in the
tree that use keys whose location log changed, using the git-annex branch
diff to find those keys. (And presumably then using the keys database to get
back to the filenames.)</p>
<p>So, implementing an optimisation like this for a non-all sync has two
separate diffs which would have to be combined together somehow.</p>
<p>Doing that in constant memory would be hard. It seems that a bloom filter
cannot be used to check if a file was processed in the first diff and avoid
processing it again in the second diff. Because a false positive would
avoid processing a file whose location log did change. I think it would
need to use an on-disk structure maybe (eg sqlite)?</p>
<p>None of which should prevent implementing this nice optimisation for --all.</p>