This todo was originally about git-annex copy --from --to
, which got
implemented, but without the ability to stream content between the two
remotes.
So this todo remains open, but is now only concerned with streaming an object that is being received from one remote out to another repository without first needing to buffer the whole object on disk.
git-annex's remote interface does not currently support that.
retrieveKeyFile
stores the object into a file. And storeKey
sends an object file to a remote.
The interface would need to be extended to support this, and the external special remote interface as well. As well as git remotes and special remotes needing to be updated to support the new interface.
There's also a question of buffering. If a object is being received from a remote faster than it is being sent to the other remote, it has to be buffered somewhere, eg not in memory. Or the receive interface needs to include a way to get the sender to pause.
Recieving to a file, and sending from the same file as it grows is one possibility, since that would handle buffering, and it might avoid needing to change interfaces as much. It would still need a new interface since the current one does not guarantee the file is written in-order.
A fifo is a possibility, but would certianly not work with remotes that don't write to the file in-order. Also resuming a download would not work with a fifo, the sending remote wouldn't know where to resume from.
(Let's not discuss the behavior of copy --to when the file is not locally present here; there is plenty of other discussion of that in eg http://bugs.debian.org/671179)
git-annex's special remote API does not allow remote-to-remote transfers without spooling it to a file on disk first. And it's not possible to do using rsync on either end, AFAICS. It would be possible in some other cases but this would need to be implemented for each type of remote as a new API call.
Modern systems tend to have quite a large disk cache, so it's quite possible that going via a temp file on disk is not going to use a lot of disk IO to write and read it when the read and write occur fairly close together.
The main benefit from streaming would probably be if it could run the download and the upload concurrently. But that would only be a benefit sometimes. With an asymmetric connection, saturating the uplink tends to swamp downloads. Also, if download is faster than upload, it would have to throttle downloads (which complicates the remote API much more), or buffer them to memory (which has its own complications).
Streaming the download to the upload would at best speed things up by a factor of 2. It would probably work nearly as well to upload the previously downloaded file while downloading the next file.
It would not be super hard to make
git annex copy --from --to
download a file, upload it, and then drop it, and parallelizing it with -J2 would keep both the --from and --to remotes bandwidth saturated pretty well.Alhough not perfectly, because two jobs could both be downloading while the uplink is left idle. To make it optimal, it would need to do the download and when done, push the upload+drop into another queue of actions that is processed concurrently with other downloads.
And there is a complication with running that at the same time as eg
git annex get
of the same file. It would be surprising for get to succeed (because copy has already temporarily downloaded the file) and then have the file later get dropped. So, it seems thatcopy --from --to
would need to stash the content away in a temp file somewhere instead of storing it in the annex proper.Here's my use case (much simpler)
Three git repos:
desktop: normal checkout, source of almost all annexd files, commits, etc.. The only place I run git annex commands. Not enough space to stored all annexed files
main_external: bare git repo, stores all annext file contents, but no file tree. Usually connected. Purpose: primary backups
old_external: like main_external, except connected only occasionally.
I periodically copy from desktop to main_external. That's all well and good.
The tricky part is when I plug in old_external and want to get everything on there. It's hard to get content onto old_external that is stored only on main_external. That's when I want to:
Note that this would not copy obsolete data (ie only referenced from old git commits) stored in old_external. I like that.
To work around the lack of that feature, I try to keep coppies on desktop until I've had a chance to copy them to both external drives. It's good for numcopies, but I don't like trying to keep track of it, and I wish I could choose to let there be just one copy of things on main_external for replaceable data.
Agreed, it's kind of secondary.
yeah, i noticed that when writing my own special remote.
That is correct.
... and would fail for most, so there's little benefit there.
how about a socket or FIFO of some sort? i know those break a lot of semantics (e.g.
[ -f /tmp/fifo ]
fails in bash) but they could be a solution...true. there are also in-memory files that could be used, although I don't think this would work across different process spaces.
for me, the main benefit would be to deal with low disk space conditions, which is quite common on my machines: i often cram the disk almost to capacity with good stuff i want to listen to later... git-annex allows me to freely remove stuff when i need the space, but it often means i am close to 99% capacity on the media drives i use.
that is true.
presented like that, it's true that the benefits of streaming are not good enough to justify the complexity - the only problem is large files and low local disk space... but maybe we can delegate that solution to the user: "free up at least enough space for one of those files you want to transfer".
[... -J magic stuff ...]
My thoughts exactly: actually copying the files to the local repo introduces all sorts of weird --numcopies nastiness and race conditions, it seems to me.
thanks for considering this!
A solution to this subproblem would transparently fall out of a facility for logically dropping files, which was briefly talked about a long time ago. Just mark the file as logically dropped. If the user
git annex get
s it while the copy-out is in progress, its status will change to "present", so copy will know not to physically delete it.(Of course there are race conditions involved, but I presume/hope that they're no worse than git-annex already has to deal with.)
@JasonWoof re your use case, you don't need transitive transfers. You can simply go to
old_external
and run:(Assuming that the
old_external
has amain_external
remote.)This --branch option was added in a recent git-annex release.
One more use case: running git-annex inside a VirtualBox VM (Linux guest on Windows host), with the repo inside the VM, operating on files on the host through shared folders. Shared folder is too large to copy inside the VM. With the recent addition of importing from directory special remote without downloading (thanks @joeyh for implementing that!), can import the files into the repo, bypassing ?issues with VirtualBox shared folder filesystem. But can't upload the files to a cloud special remote, without first
git-annex-get
ting them into the VM, and I unfortunately don't have space for two local copies of each file. So a direct transfer from a directory special remote to a cloud special remote would help a lot.The alternative, of course, is to run git-annex directly on Windows, but I've run into speed issues with that, and using WSL means losing ability to run VirtualBox, so running git-annex from a guest VM and operating on host files through the shared folder filesystem -- broken as that is -- seems like the best option right now.
git-annex copy --from foo --to bar
is implemented now (and move too).But it does need to download a copy of each file in turn, before uploading it. Then the local copy can be deleted. Without streaming, that's the best that can be done. And I think it's probably good enough for most uses.
So this todo is only about streaming now.
Strictly speaking it's possible to do better than
git-annex copy --from --to
currently does.When git-annex is used as a proxy to a P2P remote, it streams the P2P protocol from client to remote, and so needs no temp files.
So in a way, the P2P protocol is the real solution to this? Except special remote don't use the P2P protocol.