todogit-annexhttp://git-annex.branchable.com/todo/git-annexikiwiki2024-03-26T17:48:52Zexternal special remotes not using git-annex-remote in namehttp://git-annex.branchable.com/todo/external_special_remotes_not_using_git-annex-remote_in_name/2024-03-26T17:48:52Z2024-03-26T17:48:52Z
<p>rclone now supports being run as a git-annex special remote natively
see <a href="https://github.com/rclone/rclone/pull/7654">https://github.com/rclone/rclone/pull/7654</a>. "rclone gitannex"
is the command to run. But git-annex needs a git-annex-remote-rclone or similar,
so they are shipping a git-annex-remote-rclone-builtin symlink to rclone,
and when run under that name it behaves as if "rclone gitannex" were run.</p>
<p>So in this case, the need for "git-annex-remote-foo" is complicating an
upstream project that has gone out of its way to support git-annex. Not ideal.</p>
<p>From the pull request, @dmcardle wrote:</p>
<blockquote><p>My taste would be to implement a more generic mechanism rather than adding a special case for rclone gitannex.
What if externaltype could be repeated, so that git annex initremote MyRemote type=external externaltype=rclone >
externaltype=gitannex ... would cause git-annex to exec rclone with the additional gitannex arg?</p></blockquote>
<p>But, that seems to present a security problem. Consider an attacker who runs
<code>git-annex initremote foo type=external autoenable=true externaltype=rm externaltype=/foo</code></p>
<p>My conclusion is that git-annex can't provide a generic way to run a different
command for an external special remote. Any such commands need to be
whitelisted in some way. And if they're whitelisted, it seems better to not
require the user to enter additional parameters at all.</p>
<p>So one way would be to make "git-annex initremote foo type=external externaltype=rclone-builtin"
run "rclone gitannex".</p>
<p>Or, git-annex add an internal rclone special remote, that is just
a wrapper around the external special remote, that makes it use
"rclone gitannex". "git-annex initremote foo type=rclone" --<a href="http://git-annex.branchable.com/users/joey/">Joey</a></p>
proving preferred content behaviorhttp://git-annex.branchable.com/todo/proving_preferred_content_behavior/2024-03-13T15:04:07Z2024-03-13T15:04:07Z
<p>Preferred content expressions can be complicated to write and reason about.
A complex expression can involve lots of repositories that can get into
different states, and needs to be written to avoid unwanted behavior.</p>
<p>It would be very handy to provide some way to prove things about behavior
of preferred content expressions, or a way to simulate the behavior of a
network of git-annex repositories with a given preferred content configuration</p>
<h2>motivating examples</h2>
<p>For example, consider two reposities A and B. A is in group M and B is in
group N. A has preferred content <code>not inallgroup=N</code> and B has <code>not inallgroup=M</code>.</p>
<p>If A contains a file, then B will want to also get a copy. And things
stabilize there. But if the file is removed from A, then B also wants to
remove it. And once B has removed it, A wants a copy of it. And then B also
wants a copy of it. So the result is that the file got transferred twice,
to end up right back where we started.</p>
<p>The worst case of this is <code>not present</code>, where the file gets dropped and
transferred over and over again. The docs warn against using that one. But
they can't warn about every bad preferred content expression.</p>
<h2>balanced preferred content</h2>
<p>When <a href="http://git-annex.branchable.com/design/balanced_preferred_content/">balanced preferred content</a> is added, a whole new level of
complexity will exist in preferred content expressions, because now an
expression does not make a file be wanted by a single repository, but
shards the files amoung repositories in a group.</p>
<p>And up until this point preferred content expressions have behaved the same no
matter the sizes of the underlying repositories, but balanced preferred
content does take repository fullness into account, which further
complicates fully understanding the behavior.</p>
<p>Notice that <code>balanced()</code> (in the current design) is not stable when used
on its own, and has to be used as part of a larger expression to make it
stable, eg:</p>
<pre><code>((balanced(backup) and not (copies=backup:1)) or present
</code></pre>
<p>So perhaps <code>balanced()</code> should include the other checks in it,
to avoid the user shooting themselves in the foot. On the other
hand, if <code>balanced()</code> implicitly contains <code>present</code>, then <code>not balanced()</code>
would include <code>not present</code>, which is bad!</p>
<p>(For that matter, what does <code>not balanced()</code> even do currently?)</p>
<h2>proof</h2>
<p>What could be proved about a preferred content expression?</p>
<p>No idea really. Would be interesting to consider what formal methods can
do here. Could a SAT solver be used somehow for example?</p>
<h2>static analysis</h2>
<p>Clearly <code>not present</code> is an problematic preferred content expression. It
would be good if git-annex warned and/or refused to set such an expression
if it could detect it. Similarly <code>not groupwanted</code> could be detected as a
problem when the group's preferred content expression contains <code>present</code>.</p>
<p>Is there is a more general purpose and not expensive way to detect such
problematic expressions, that can find problems such as the
<code>not inallgroup=N</code> example above?</p>
<h2>simulation</h2>
<p>Simulation seems fairly straightforward, just simulate the network of
git-annex repositories with random files with different sizes and
metadata. Be sure to enforce invariants like numcopies the same as
git-annex does.</p>
<p>Since users can write preferred content expressions, this should be
targeted at being used by end users.</p>
track free space in repos via git-annex branchhttp://git-annex.branchable.com/todo/track_free_space_in_repos_via_git-annex_branch/2024-03-11T14:00:49Z2024-03-05T19:06:51Z
<p>If the total space available in a repository for annex objects is recorded
on the git-annex branch (by the user running a command probably, or perhaps
automatically), then it is possible to examine the git-annex branch and
tell how much free space a remote has available.</p>
<p>One use case is just to display it in <code>git-annex info</code>. But a more
compelling use case is <a href="http://git-annex.branchable.com/design/balanced_preferred_content/">balanced preferred content</a>, which needs a
way to tell when an object is too large to store on a repository, so that
it can be redirected to be stored on another repository in the same group.</p>
<p>This was actually a fairly common feature request early on in git-annex
and I probably should have thought about it more back then!</p>
<p><code>git-annex info</code> has recently started summing up the sizes of repositories
from location logs, and is well optimised. In my big repository, that takes
8.54 seconds of its total runtime.</p>
<p>Since info already knows the repo sizes, just adding a <code>git-annex maxsize
here 200gb</code> type of command would let it display the free space of all
repos that had a maxsize recorded, essentially for free.</p>
<p>But 8 seconds is rather a long time to block a <code>git-annex push</code>
type command. Which would be needed if any remote's preferred content
expression used the free space information.</p>
<p>Would it be possible to update incrementally from the previous git-annex
branch to the current one? That's essentially what <code>git-annex log
--sizesof</code> does for each commit on the git-annex branch, so could
imagine adapting that to store its state on disk, so it can resume
at a new git-annex branch commit.</p>
<p>Perhaps a less expensive implementation than <code>git-annex log --sizesof</code>
is possible, to get only the current sizes, if the past sizes are known at a
particular git-annex branch commit. We don't care about sizes at
intermediate points in time, which that command does calculate.</p>
<p>See <a href="http://git-annex.branchable.com/todo/info_--size-history/">info --size-history</a> for the subtleties that had to be handled.
In particular, compating the previous git-annex branch commit to current may
yield lines that seem to indicate content was added to a repo, but in fact
that repo already had that content at the previous git-annex branch commit
and another log line was recorded elsewhere redundantly.
So it needs to look at the location log's value at the
previous commit in order to determine if a change to a log should be
counted.</p>
<p>Worst case, that's queries of the location log file for every single key.
If queried from git, that would be slow -- slower than <code>git-annex info</code>'s
streaming approach. If they were all cached in a sqlite database, it might
manage to be faster?</p>
<h2>incremental update via git diff</h2>
<p>Could <code>git diff -U1000000</code> be used and the patch parsed to get the complete
old and new location log? (Assuming no log file ever reaches a million
lines.) I tried this in my big repo, and even diffing from the first
git-annex branch commit to the last took 7.54 seconds.</p>
<p>Compare that with the method used by <code>git-annex info</code>'s size gathering, of
dumping out the content of all files on the branch with <code>git ls-tree -r
git-annex |awk '{print $3}'|git cat-file --batch --buffer</code>, which only
takes 3 seconds. So, this is not ideal when diffing to too old a point.</p>
<p>Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.<br />
... from 3 months ago takes 2 seconds.<br />
... from 1 week ago takes 1 second.</p>
<h2>incremental update when merging git-annex branch</h2>
<p>When merging git-annex branch changes into .git/annex/index,
it already diffs between the branch and the index and uses <code>git cat-file</code>
to get both versions of the file in order to union merge them.</p>
<p>That's essentially the same information needed to do the incremental update
of the repo sizes. So could update sizes at the same time as merging the
git-annex branch. That would be essentially free!</p>
<p>Note that the use of <code>git cat-file</code> in union merge is not --buffer
streaming, so is slower than the patch parsing method that was discussed in
the previous section. So it might be possible to speed up git-annex branch
merging using patch parsing.</p>
<p>Note that Database.ContentIdentifier and Database.ImportFeed also update
by diffing from the old to new git-annex branch (with <code>git cat-file</code> to
read log files) so could also be sped up by being done at git-annex branch
merge time. Those are less expensive than diffing the location logs only
because the logs they diff are less often used, and the work is only
done when relevant commands are run.</p>
speed up VURL by avoiding redundant hashinghttp://git-annex.branchable.com/todo/speed_up_VURL_by_avoiding_redundant_hashing/2024-03-01T20:53:17Z2024-03-01T20:53:17Z
<p>When a VURL key has multiple equivilant keys that all use the same hash,
verifying the VURL key currently has to verify each equivilant key.
Usually that is done incrementally, so it only has to read the file once. But it
still does redundant work, updating each incremental verifier with each
chunk.</p>
<p>This could be improved by caclulating a hash once, and then compare it
with a hash value exposed by the Backend. That seems doable but will mean
extending the Backend interface, to expose the hash value and type.
--<a href="http://git-annex.branchable.com/users/joey/">Joey</a></p>
migration to VURL by defaulthttp://git-annex.branchable.com/todo/migration_to_VURL_by_default/2024-03-05T19:06:51Z2024-02-29T21:53:19Z
<p><code>git-annex addurl --fast/--relaxed --verifiable</code> now uses VURL keys,
which is an improvement over the old, un-verifiable URL keys. But users
have to know to use it, and can have URL keys in their repository.</p>
<p>Note that old git-annex, when in a repo with VURL keys, is still able to
operate on them fine, even though it doesn't know what they are.
Only fsck warns about them. So --verifiable could become the default
reasonably soon. It's not necessary to wait for everyone to have the new
version of git-annex.</p>
<p>It might be good to nudge users to migrate their existing files to VURL
eventually. Could be a fsck warning about URL keys, at some point after
--verifiable becomes the default for addurl. Or could be a warning when
transferring the content between repositories that it's not possible to
verify it.</p>
<p>Of course if users want to continue to use their existing URL keys and
not be able to verify content, that's fine. Users can also choose to use
WORM keys if they really want to.</p>
<p>However, I don't think there's enough reason to want to use URL keys to add
configuration of which kind of keys addurl uses, once VURL is the default.
--<a href="http://git-annex.branchable.com/users/joey/">Joey</a></p>
<blockquote><p>One way this might cause trouble is that current <code>git-annex registerurl</code>
and <code>unregisterurl</code> (and <code>fromkey</code>) when passed an url rather than a key,
generates an URL key. If that is changed to generate a VURL key, then
it might break some workflow, particularly one where an url was
registered as an URL key and is now being unregistered.</p></blockquote>
improve unused for special remoteshttp://git-annex.branchable.com/todo/improve_unused_for_special_remotes/2024-02-27T16:44:43Z2024-02-27T16:44:43Z
<p>A remote like the directory special remote can have objects that have not
been fully transferred to it by an interrupted copy, that linger until the
copy is re-run and the content gets fully sent to the remote. It would be
good if <code>git-annex unused</code> could find and clean up such things, like it
does for incomplete transfers into a git-annex repository.</p>
<p>In the directory special remote, these are files named "tmp/$key/$key".</p>
<p>This would need to be an extension to the remote interface to add an action
to find when a key has such a file, and an action to delete one of them.</p>
<p>A problem is that any such file might actually still be in the process
of being sent, perhaps from a different repository than the one where
<code>git-annex unused</code> is being run. So deleting such a file could cause that
transfer to fail. This problem seems unavoidable generally.</p>
<hr />
<p>It's also possible for a special remote to get keys stored in it which
git-annex does not know about. For example, in a temporary clone of the
git-annex repository, add a new file. Send it to the special remote. Then
delete the temporary clone.</p>
<p><code>git-annex unused --from</code> can't detect those keys, because it can only ask
the special remote about presence of keys that it knows about.</p>
<p>Might it be possible to solve both problems together? Eg, add an action
that has the special remote list all keys and partial keys present in it.
--<a href="http://git-annex.branchable.com/users/joey/">Joey</a></p>
add --json-progress support in push and pullhttp://git-annex.branchable.com/todo/add_--json-progress_support_in_push_and_pull/2024-02-03T18:48:35Z2024-02-03T18:48:35Z
<p>The pull and push commands do not have --json-progress support. Please add.</p>
copy/move support for pushinsteadOf http://git-annex.branchable.com/todo/copy__47__move_support_for_pushinsteadOf_/yoh2024-02-02T16:27:19Z2024-02-02T16:27:19Z
<p>ATM</p>
<p>files didn't <code>datalad push</code> as they should have due to existing settings of <code>wanted</code>:</p>
<pre><code>❯ git annex wanted datasets.datalad.org
include=.datalad/* and (not metadata=distribution-restrictions=*)
❯ git annex find --not --in datasets.datalad.org .
crcns-2022-dataland.pdf
crcns-2022-dataland.png
crcns-2022-dataland.svg
❯ git annex metadata *
metadata crcns-2022-dataland.pdf
ok
metadata crcns-2022-dataland.png
ok
metadata crcns-2022-dataland.svg
ok
❯ git annex copy --auto --to datasets.datalad.org *
❯ git annex version
git-annex version: 10.20231227-1~ndall+1
</code></pre>
<p>so I was confused... the reason was</p>
<pre><code>❯ git annex copy --to datasets.datalad.org *
copy crcns-2022-dataland.pdf (to datasets.datalad.org...)
copying to non-ssh repo not supported
failed
copy crcns-2022-dataland.png (to datasets.datalad.org...)
copying to non-ssh repo not supported
failed
copy crcns-2022-dataland.svg (to datasets.datalad.org...)
copying to non-ssh repo not supported
failed
copy: 3 failed
</code></pre>
<p>wherever I have</p>
<pre><code class="shell">❯ git remote show -n datasets.datalad.org
* remote datasets.datalad.org
Fetch URL: https://datasets.datalad.org/datalad/artwork/.git
Push URL: falkor.dartmouth.edu:/srv/datasets.datalad.org/www/datalad/artwork/.git
...
</code></pre>
<p>and the use case is quite common for me and in particular for ReproNim/containers which is shared/adjusted in similar ways</p>
allow configuring assistant to add files lockedhttp://git-annex.branchable.com/todo/allow_configuring_assistant_to_add_files_locked/2024-01-18T17:11:33Z2024-01-18T17:11:33Z
<p>The assistant adds files unlocked, even when git-annex is otherwise
configured to add them locked.</p>
<p>There are good reasons for that different behavior to be the default,
but it would be worth having a config to override that.</p>
<p>Eg, when annex.assistant.honor.addunlocked is set, honor the
annex.addunlocked configuration, which defaults to adding files locked.
Or perhaps a better name would be annex.assistant.allowaddlocked.</p>
<p>See here for some motivating use cases
<a href="https://git-annex.branchable.com/forum/Is_there_a_way_to_have_assistant_add_files_locked__63__/">https://git-annex.branchable.com/forum/Is_there_a_way_to_have_assistant_add_files_locked__63__/</a></p>
support using Stateless OpenPGP instead of gpg for encryption methods other than encryption=sharedhttp://git-annex.branchable.com/todo/support_using_Stateless_OpenPGP_instead_of_gpg/2024-01-12T17:52:09Z2024-01-09T20:57:14Z
<p><a href="https://datatracker.ietf.org/doc/draft-dkg-openpgp-stateless-cli/">https://datatracker.ietf.org/doc/draft-dkg-openpgp-stateless-cli/</a> or "sop" is
a standard for a command-line interface for OpenPGP. There are several
implementations available in Debian already, like sqop (using Sequoia), gosop,
gpainless-cli, and hop from hopenpgp-tools (haskell).</p>
<p>It's possible to use this in a way that interoperates with gpg. For example:</p>
<pre><code>joey@darkstar:~>cat pw
hunter1
joey@darkstar:~>cat foo
Tue Jan 9 15:10:32 JEST 2024
joey@darkstar:~>sqop encrypt --with-password=pw < foo > bar
joey@darkstar:~>gpg --passphrase-file pw --decrypt bar
gpg: AES256.CFB encrypted session key
gpg: encrypted with 1 passphrase
Tue Jan 9 15:10:32 JEST 2024
</code></pre>
<p>That example uses symmetric encryption, which is what git-annex uses
for encryption=shared. So git-annex could use this or gpg to access the
same encrypted special remote.</p>
<p>Update: That's implemented now, when annex.shared-sop-command is configured
it will be used for encryption=shared special remotes. It interoperates
fine with using gpg, as long as the sop command uses a compatable profile
(setting annex.shared-sop-profile = rfc4880 is probably a good idea).</p>
<p>Leaving this todo open because there are other encryption schemes than
encryption=shared, for which using sop is not yet supported.</p>
<p>For the public key encryption used by the other encryption= schemes,
sop would be harder to use, because it does not define any location
to store private keys. Though it is possible to export gpg private keys
and use them with sop to encrypt/decrypt files, it wouldn't make sense
for git-annex to do that for the user. So there would need to be some way
to map from keyid= value to a file containing the key. Which could be as
simple as files named <code>.git/annex/sop/$keyid.sec</code>, which would be installed
by the user using whatever means they prefer.</p>
<p>Since sop also supports hardware-backed secret keys, and using one avoids
the problem of where to store the secret key file, it would be good to
support using those. This could be something like <code>keyid=@HARDWARE:xxx</code>
making <code>@HARDWARE:xxx</code> be passed to sop.</p>
<p>It seems that git-annex would need to prompt for the secret key's password
itself, since sop doesn't prompt, and feed it in via <code>--with-key-password</code>.
It can detect if a password is needed by trying the sop operation without a
password and checking for an exit code of 67.
See <a href="https://gitlab.com/dkg/openpgp-stateless-cli/-/issues/64">this issue on sop password prompting</a></p>
<p>See also: <a href="http://git-annex.branchable.com/todo/whishlist__58___GPG_alternatives_like_AGE/">whishlist: GPG alternatives like AGE</a></p>