todo/Don't re-encrypt when key is already in .git/annex/tmpgit-annexhttp://git-annex.branchable.com/todo/Don__39__t_re-encrypt_when_key_is_already_in_.git__47__annex__47__tmp/git-annexikiwiki2020-06-17T01:18:32Zcomment 1http://git-annex.branchable.com/todo/Don__39__t_re-encrypt_when_key_is_already_in_.git__47__annex__47__tmp/comment_1_0650918c86fca0554755aede19a12fd3/joey2020-06-17T01:18:32Z2020-02-17T15:33:31Z
<p>@lykos, what happens when git-annex-remote-googledrive tries
to resume in this situation and git-annex has written a different tmp file
than what it partially uploaded before?</p>
<p>I imagine it might resume after the last byte it sent before, and so
the uploaded file gets corrupted?</p>
<p>If so, there are two hard problems with this idea:</p>
<ol>
<li>If git-annex changes to reuse the same tmp file, then git-annex-remote-googledrive
will work with the new git-annex, but corrupt files when used with an old
git-annex.</li>
<li>If someone has two clones, and starts an upload in one, but it's
interrupted and then started later in the second clone, it would again
corrupt the file that gets uploaded. (This would also happen,
with a single clone, if git-annex unused gets used in between upload
attempts, and cleans up the tmp file.)</li>
</ol>
<p>The first could be dealt with by some protocol flag, but the second seems
rather intractable, if git-annex-remote-googledrive behaves as I
hypothesize it might. And even if git-annex-remote-googledrive behaves
better that that somehow, it's certianly likely that some other remote
would behave that way at some point.</p>
<hr />
<p>As to implementation details, I started investigating before thinking
about the above problem, so am leaving some notes here:</p>
<p>This would first require that the tmp file is written atomically,
otherwise an interruption in the wrong place would resume with a partial
file. (File size can't be used since gpg changes the file size with
compression etc.) Seems easy to implement: Make
Remote.Helper.Special.fileStorer write to a different tmp file and rename
it into place.</p>
<p>Internally, git-annex pipes the content from gpg, so it is only written to
a temp file when using a remote that operates on files, as the external
remotes do. Some builtin remotes don't. So resuming an upload to an
encrypted remote past the chunk level can't work in general.</p>
<p>There would need to be some way for the code that encrypts chunks
(or whole objects) to detect that it's being used with a remote that
operates on files, and then check if the tmp file already exists, and avoid
re-writing it. This would need some way to examine a <code>Storer</code> and tell
if it operates on files, which is not currently possisble, so would need
some change to the data type.</p>
comment 2http://git-annex.branchable.com/todo/Don__39__t_re-encrypt_when_key_is_already_in_.git__47__annex__47__tmp/comment_2_aa6bf987de3dfc9f78eba30b4b2c8c16/lykos2020-06-17T01:18:32Z2020-02-27T13:12:28Z
<p>git-annex-remote-googledrive compares partial checksums after every transmitted chunk and before resuming a transfer, so it would not be affected by both problems you describe. However, you're probably right that some other remote might not behave that way.</p>
<p>My current workaround is to move the file to a tmp directory specific to the remote (and UUID) and when uploading, prefer files inside this directory. Downsides are that it can use slightly more disc space and that git-annex-remote-googledrive has to handle cleanup itself. But I think it's good enough. And the problem probably does not justify changes as big as apparently needed. Thanks for your thoughts on this!</p>
comment 3http://git-annex.branchable.com/todo/Don__39__t_re-encrypt_when_key_is_already_in_.git__47__annex__47__tmp/comment_3_3419d2603af199fd311f6898adef5ed3/joey2020-06-17T01:18:32Z2020-02-28T18:00:01Z
<p>There are probably some other special remotes that are similarly able
to resume and would safely deal with the problems I mentioned. rsync
comes to mind. But I'm inclined to agree with you that the scope of the
changes I found needed to support it in git-annex may not be warranted.</p>
<p>Using your own tmp file a reasonable workaround.</p>
<p>git-annex actually has an internal concept of a "tmp work dir"
which is associated with a key and can contain whatever tmp files
might be needed to transfer that key. The nice thing about it is
that any time git-annex deletes a key's tmp file, it first deletes
its tmp work dir. The annoying this about it is that the tmp file
has to exist (even if empty) as long as the tmp work dir does,
otherwise there's a risk the directory will never get cleaned up.</p>
<p>It's currently only used for downloads, eg with youtube-dl. But it would
probably also work for uploads. It might make sense to extend the protocol
to request git-annex tell what the tmp work dir is, and to make sure the
above invariant is satisfied. But if you would like to live a little
dangerously, just take the name of the tmp file, and prefix "work.".
Eg, for the tmp file ".git/annex/tmp/SHA256--x", use
".git/annex/tmp/work.SHA256--x/".</p>