I've implemented true resumable upload in git-annex-remote-googledrive which means that uploads can, just as downloads, be resumed at any point, even within one chunk. However, it currently does not work with encrypted files (or chunks) due to the non-deterministic nature of GPG. In order to make this feature useable on encrypted files, I propose to not overwrite encrypted files which are already present inside the tmp
directory.
@lykos, what happens when git-annex-remote-googledrive tries to resume in this situation and git-annex has written a different tmp file than what it partially uploaded before?
I imagine it might resume after the last byte it sent before, and so the uploaded file gets corrupted?
If so, there are two hard problems with this idea:
The first could be dealt with by some protocol flag, but the second seems rather intractable, if git-annex-remote-googledrive behaves as I hypothesize it might. And even if git-annex-remote-googledrive behaves better that that somehow, it's certianly likely that some other remote would behave that way at some point.
As to implementation details, I started investigating before thinking about the above problem, so am leaving some notes here:
This would first require that the tmp file is written atomically, otherwise an interruption in the wrong place would resume with a partial file. (File size can't be used since gpg changes the file size with compression etc.) Seems easy to implement: Make Remote.Helper.Special.fileStorer write to a different tmp file and rename it into place.
Internally, git-annex pipes the content from gpg, so it is only written to a temp file when using a remote that operates on files, as the external remotes do. Some builtin remotes don't. So resuming an upload to an encrypted remote past the chunk level can't work in general.
There would need to be some way for the code that encrypts chunks (or whole objects) to detect that it's being used with a remote that operates on files, and then check if the tmp file already exists, and avoid re-writing it. This would need some way to examine a
Storer
and tell if it operates on files, which is not currently possisble, so would need some change to the data type.git-annex-remote-googledrive compares partial checksums after every transmitted chunk and before resuming a transfer, so it would not be affected by both problems you describe. However, you're probably right that some other remote might not behave that way.
My current workaround is to move the file to a tmp directory specific to the remote (and UUID) and when uploading, prefer files inside this directory. Downsides are that it can use slightly more disc space and that git-annex-remote-googledrive has to handle cleanup itself. But I think it's good enough. And the problem probably does not justify changes as big as apparently needed. Thanks for your thoughts on this!
There are probably some other special remotes that are similarly able to resume and would safely deal with the problems I mentioned. rsync comes to mind. But I'm inclined to agree with you that the scope of the changes I found needed to support it in git-annex may not be warranted.
Using your own tmp file a reasonable workaround.
git-annex actually has an internal concept of a "tmp work dir" which is associated with a key and can contain whatever tmp files might be needed to transfer that key. The nice thing about it is that any time git-annex deletes a key's tmp file, it first deletes its tmp work dir. The annoying this about it is that the tmp file has to exist (even if empty) as long as the tmp work dir does, otherwise there's a risk the directory will never get cleaned up.
It's currently only used for downloads, eg with youtube-dl. But it would probably also work for uploads. It might make sense to extend the protocol to request git-annex tell what the tmp work dir is, and to make sure the above invariant is satisfied. But if you would like to live a little dangerously, just take the name of the tmp file, and prefix "work.". Eg, for the tmp file ".git/annex/tmp/SHA256--x", use ".git/annex/tmp/work.SHA256--x/".