design/assistant/bloggit-annexhttp://git-annex.branchable.com/design/assistant/blog/git-annexikiwiki2023-06-26T16:40:19Zday 651 a major release and a conferencehttp://git-annex.branchable.com/devblog/day_651__a_major_release__and_a_conference/2023-06-26T16:40:19Z2023-06-26T15:53:43Z
<p>Today I'm releasing git-annex 10.20230626. This release got delayed for 2
months due to making some breaking changes in how filenames with unusual
characters are quoted. So it has an unsual amount of changes in it, and is
as major as a git-annex release gets without being a repository version
bump.</p>
<p>But I mostly wanted to announce that we are planning a joint git-annex/Datalad meeting!
In the first half of 2024, probably in Germany. If you would be interested in
attending, please fill out this
<a href="https://docs.google.com/forms/d/e/1FAIpQLSf0PVLyzicTnAIz8dN7PrT8runUl2W9QxSWCzZoEW3x5bD9lA/viewform">brief survey</a>.
While Datalad is aimed at scientists, and I look forward to spending time
with them, you do not have to be a scientist to attend, any interested
users are welcome.</p>
<p>Also, if you'd like to learn about git-annex in Germany this weekend,
Yann Büchau is hosting a
<a href="https://cfp.tuebix.org/tuebix-2023/talk/review/GWRP3UKE3VFKVDG8RNQ8ZZPCZPNZYYWM">workshop</a>.</p>
day 649-650 speeding up repeated importshttp://git-annex.branchable.com/devblog/day_649-650__speeding_up_repeated_imports/2023-06-01T22:43:33Z2023-06-01T22:43:33Z
<p>Importing trees from special remotes still feels a bit like a new feature,
although it was added to git-annex in 2019. I don't know if many people are
using it. I've had some complaints about it being slow when the remote
contains a large number of files (eg 100 thousand).</p>
<p>I've just finished speeding up repeated imports from a special remote a
lot, when the special remote contains a large number of files, and few or
no files have changed.</p>
<p>git-annex was spending a lot of time converting content identifiers to
keys. Each conversion took a database lookup, which was slow enough to
become painful in bulk.</p>
<p>I thought of a neat trick. Take the sha1 of a content identifier, and
create a git tree of the files in the special remote, using those sha1s as
the content of the files. Of course, that is not the actual content of any
file that git knows about. But it doesn't matter, because once git-annex
has those trees, it can diff the current tree to the tree from the previous
import. And that tells it which files have changed. Then it only has to do
database lookups for the changed files.</p>
<p>This turned out to be one of the best results I've ever gotten from a
git-annex optimisation. It runs 60x faster or more with more files!</p>
<p>The moral is that git is really good at diffing trees fast, and so it's
worth using git diff whenever possible, even if the thing being diffed is
not a regular tree of files.</p>
<p>This work was sponsored by Mark Reidenbach and Lawrence Brogan
<a href="https://patreon.com/joeyh">on Patreon</a></p>
day 644-648 terminal escape sequenceshttp://git-annex.branchable.com/devblog/day_644-648__terminal_escape_sequences/2023-04-12T19:03:10Z2023-04-12T19:03:10Z
<p>Last weekend I watched a talk
<a href="https://www.youtube.com/watch?v=4kfDBNzStbs">"Houdini of the Terminal: The need for escaping"</a>
which shows several recent exploits of terminal emulators using escape
sequences. It was eye opening that security holes like that are still
being found, and also how severe some of the results can be. I was already
familiar with escape sequences as a potential security hole, but it never
seemed to make sense to have a program that was not a terminal emulator
guard against them. This talk made me think it can make sense for some
programs, as a defence in depth.</p>
<p>Now git does escape unusual characters when displaying filenames (most of
the time). But git-annex never has. So it seems it would be a good idea to
make git-annex follow git's lead on this. And git has a core.quotePath
which can be used to make it not escape unicode characters, so git-annex
should also support that.</p>
<p>Implementing that was not very easy, because there are a vast number of
places where git-annex can display a filename. I had to check every error
message and warning message and other output in the whole code base to find
ones that displayed a filename. That took a while.</p>
<p>While doing that, I realized that there are some other ways a control
character could be stored in the git repository that would cause git-annex
to display it. It's possible for a git-annex key to have a control
character in its name. And a few other things stored in the git-annex
branch, like metadata, could also contain control characters.</p>
<p>I decided the best way to deal with those is not with some complex
escaping, but just by filtering out the control characters on output. In fact,
git-annex now filters out control characters in basically all its output.
The exceptions are some cases where filtering is not done when it's outputting
to a pipe, and that commands like <code>git-annex find</code> that support <code>--format</code>
only do escaping when requested by the format.</p>
<p>By the way, it turns out that git will display control characters in
the names of remotes or branches. Possibly in other situations too.
(I do wonder if a git remote that uses control characters in a branch
could be used to exploit a terminal emulator?) So git-annex has now gone
further than git in this area.</p>
<p>The resulting diff is 6500 lines, and I don't consider this an actual
security fix in git-annex, but only a hardening measure. So I won't be
hurrying out the next release for this.</p>
<p>This work was sponsored by Jake Vosloo, unqueued, Graham Spencer,
and Erik Bjäreholt <a href="https://patreon.com/joeyh">on Patreon</a></p>
day 643 adjusted view brancheshttp://git-annex.branchable.com/devblog/day_643__adjusted_view_branches/2023-02-27T20:28:28Z2023-02-27T20:28:28Z
<p>(Tap tap. Oh, this devblog is still on?)</p>
<p>View branches are a neat corner of git-annex that have remained kind of
obscure since I implemented them back in 2014. Not many improvements
have been made from back then until recently.</p>
<p>Today I implemented a longstanding todo, unifying view branches with
adjusted branches. The result is that you can enter an adjusted branch from
a view branch, or a view branch from an adjusted branch, and get what you
would probably expect.</p>
<p>For example, to sort your annexed files into directories by author and
year, and have all annexed files in the view be unlocked:</p>
<pre><code>git-annex adjust --unlock
git-annex view author=* year=*
</code></pre>
<p>Earlier this month, I addressed probably the main missing feature of view
branches, by making <code>git-annex sync</code> work in a view branch, updating it
with metadata and files pulled in from remotes. Although
it there is room to make it
<span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=devblog%2Fday_643__adjusted_view_branches&page=faster_incremental_update_of_view_branch_by_git-annex_sync" rel="nofollow">?</a>faster</span>
still.</p>
<p>Also, view branches can be made that include files that lack metadata.
Such files are put in a directory named <code>"_"</code>. And can be moved out of
there to other directories to set their metadata. For example:</p>
<pre><code>git-annex view author?=*
</code></pre>
<p>Views combine nicely with graphical file managers, and Yann Büchau
has recently built an
<a href="https://pypi.org/project/thunar-plugins/">integration with Thunar</a>
that supports most of these new features and can be seen in action in
<a href="https://fosstodon.org/@nobodyinperson/109836827575976439">this screencast</a>.</p>
<p>This work was sponsored by Lawrence Brogan, Erik Bjäreholt, and unqueued
<a href="https://patreon.com/joeyh">on Patreon</a></p>
day 642 cost modelhttp://git-annex.branchable.com/devblog/day_642__cost_model/2021-11-08T20:21:08Z2021-11-08T20:21:08Z
<p>Last Thursday I implemented <code>git-annex filter-process</code>, which you
can try enabling to make commands like <code>git add</code> and <code>git checkout</code>
faster when they operate on a lot of files.</p>
<pre><code>git config filter.annex.process 'git-annex filter-process'
</code></pre>
<p>On Friday, I benchmarked it
and was not surprised to find that it's slower in some cases than the
old smudge/clean filter interface, and faster in other cases. Still, good
to see actual numbers (see <a href="http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=054c803f8d7cc43eb01fdf6141ab6572373c7d60">054c803f8d7cc43eb01fdf6141ab6572373c7d60</a>).
The surprising good news is that it only seems to make <code>git add</code> around 10%
slower when adding a large file (to the annex presumably). Although I
know I can speed that up, eventually.</p>
<p>Today, I used the benchmark results to build a cost model into git-annex,
so it knows when it would be faster to have filter.annex.process set or
unset, and temporarily unsets it when that seems best. It can only
do that when it's restaging pointer files, but that was the main problem
with setting filter.annex.process really.</p>
<p>So I'm fairly close to wanting to enable it by default. But will probably
just wait until whenever v9 happens and do it then. Hopefully some people
will try it out in the meantime and perhaps I can refine the cost model.</p>
<hr />
<p>This work was sponsored by Jake Vosloo, Graham Spencer, and Dr. Land Raider
<a href="https://patreon.com/joeyh">on Patreon</a></p>
day 641 an alternative smudge filterhttp://git-annex.branchable.com/devblog/day_641__an_alternative_smudge_filter/2021-11-04T19:24:59Z2021-11-03T20:07:03Z
<p>Would you rather that <code>git checkout</code> got a lot faster at checking out a lot
of files, and <code>git add</code> got a lot faster at adding a lot of small files, if
the tradeoff was that <code>git add</code> and <code>git commit -a</code> got slower at adding
large files to the annex than they are now?</p>
<p>Being able to make that choice is what I'm working on now. Of course,
we'd rather it were all fast, but due to
<a href="http://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/">git smudge clean interface suboptiomal</a>, that is not possible
without improvements to git. But I seem to have a plan that will
work around enough of the problems to let that choice be made.</p>
<p>Today I've been laying the groundwork, by implementing git's
pkt-line interface, and the long-running filter process protocol.
Next step will be to add support for that in <code>git-annex smudge</code>,
so that users who want to can enable it with:</p>
<pre><code>git config filter.annex.process 'git-annex filter-process'
</code></pre>
<p>I can imagine that becoming enabled by default at some point in v9, if most
users prefer it over the current method. Which would still be available
by unsetting the config.</p>
<hr />
<p>Today's work was sponsored by Mark Reidenbach
<a href="https://patreon.com/joeyh">on Patreon</a></p>
day 640 finally dealt with clock skewhttp://git-annex.branchable.com/devblog/day_640__finally_dealt_with_clock_skew/2021-08-03T21:16:06Z2021-08-03T21:16:06Z
<p>I've been unsatisfied with git-annex's handling of clock skew since day 1.
Since it relies on timestamps, it needs clocks to be synchronised across
users, at least to a reasonable extent. A clock in the far future or distant
past could potentially confuse git-annex a lot. Vector clocks felt like
the right kind of solution, but also wrong somehow.</p>
<p>I've finally cracked it! See <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=devblog%2Fday_640__finally_dealt_with_clock_skew&page=todo%2Fgit-annex_branch_clocks" rel="nofollow">?</a>git-annex branch clocks</span> for the
details, but in summary, git-annex will be able to detect clock skew
and fall back to vector clocks, but will otherwise continue to use
timestamps for their benefits over vector clocks
(ie, having some idea about what order disconnected events actually occurred,
to the extent physics makes that possible).</p>
<p>That is mostly implemented, only needs some more testing and cleanup before
merging.</p>
<hr />
<p>Today's work was sponsored by Graham Spencer
<a href="https://patreon.com/joeyh">on Patreon</a></p>