I have this situation:
marcos
: home server, canonical repository with all my files (group=backup
)angela
: laptop, with a subset of the files (group=manual
)VHS
: backup external USB storage, should have a redundant copy of all files (group=manual
)
directly connecting the external USB drive to marcos
is annoying, so
I usually connect it to angela
instead, which doesn't have all the
files.
This brings up the peculiar situation that I cannot actually backup
all the files to VHS
from angela
, without first copying them
locally.
I have a few issues with that:
it fails silently: if I try to copy to
VHS
and the file is not onangela
, it silently fails:[997]anarcat@angela:mp3$ git annex drop nothere (recording state in git...) [998]anarcat@angela:mp3$ git annex copy --to VHS nothere [999]anarcat@angela:mp3$ git annex find --in VHS nothere [1001]anarcat@angela:mp3$ git annex list nothere here |VHS ||htcones |||marcos ||||web |||||bittorrent ||||||htconesdumb ||||||| ___X___ nothere [1002]anarcat@angela:mp3$
this shouldn't silently fail to copy: it should warn me that it can't find a file to copy, at least.
it takes up more disk space: i need to download all the missing files locally before I can transfer them to
VHS
. here's the way I make sure files are transfered properly onVHS
:git annex copy --to VHS --not --in VHS git annex get --not --in VHS git annex copy --to VHS --not --in VHS git annex drop --not --in 'here@{yesterday}'
the latter line is expecially problematic, because it is not accurate...
it's slower: i need to write files locally before I can transfer them. ideally, those files would be streamed, or at least I would need to buffer locally only one file at a time and not the whole batch.
Maybe I am missing something obvious here and there are other ways of
doing this. I am running 6.20160902+gitgbc49d8a-1~ndall+1
.
I know I could setup angela
to be in the transfer
group, but then
files I don't want would end up stored on angela
: files that are
missing from other remotes, for example. Even worse, some files I do
want could be candidates for removal on angela
because they have
been propagated everywhere, whereas I have a select set of files
(hence group=manual
) that are present in angela
that I want to
stay there.
It seems to me at least #1 above should be fixed: copy
shouldn't
succeed when it can't comply with the requested preferred content
expression.
Somehow, I expected this to work, and maybe that's the core issue here:
git annex copy --from marcos --to VHS nothere
Thanks for considering this! -- anarcat
(Let's not discuss the behavior of copy --to when the file is not locally present here; there is plenty of other discussion of that in eg http://bugs.debian.org/671179)
git-annex's special remote API does not allow remote-to-remote transfers without spooling it to a file on disk first. And it's not possible to do using rsync on either end, AFAICS. It would be possible in some other cases but this would need to be implemented for each type of remote as a new API call.
Modern systems tend to have quite a large disk cache, so it's quite possible that going via a temp file on disk is not going to use a lot of disk IO to write and read it when the read and write occur fairly close together.
The main benefit from streaming would probably be if it could run the download and the upload concurrently. But that would only be a benefit sometimes. With an asymmetric connection, saturating the uplink tends to swamp downloads. Also, if download is faster than upload, it would have to throttle downloads (which complicates the remote API much more), or buffer them to memory (which has its own complications).
Streaming the download to the upload would at best speed things up by a factor of 2. It would probably work nearly as well to upload the previously downloaded file while downloading the next file.
It would not be super hard to make
git annex copy --from --to
download a file, upload it, and then drop it, and parallelizing it with -J2 would keep both the --from and --to remotes bandwidth saturated pretty well.Alhough not perfectly, because two jobs could both be downloading while the uplink is left idle. To make it optimal, it would need to do the download and when done, push the upload+drop into another queue of actions that is processed concurrently with other downloads.
And there is a complication with running that at the same time as eg
git annex get
of the same file. It would be surprising for get to succeed (because copy has already temporarily downloaded the file) and then have the file later get dropped. So, it seems thatcopy --from --to
would need to stash the content away in a temp file somewhere instead of storing it in the annex proper.Here's my use case (much simpler)
Three git repos:
desktop: normal checkout, source of almost all annexd files, commits, etc.. The only place I run git annex commands. Not enough space to stored all annexed files
main_external: bare git repo, stores all annext file contents, but no file tree. Usually connected. Purpose: primary backups
old_external: like main_external, except connected only occasionally.
I periodically copy from desktop to main_external. That's all well and good.
The tricky part is when I plug in old_external and want to get everything on there. It's hard to get content onto old_external that is stored only on main_external. That's when I want to:
Note that this would not copy obsolete data (ie only referenced from old git commits) stored in old_external. I like that.
To work around the lack of that feature, I try to keep coppies on desktop until I've had a chance to copy them to both external drives. It's good for numcopies, but I don't like trying to keep track of it, and I wish I could choose to let there be just one copy of things on main_external for replaceable data.
Agreed, it's kind of secondary.
yeah, i noticed that when writing my own special remote.
That is correct.
... and would fail for most, so there's little benefit there.
how about a socket or FIFO of some sort? i know those break a lot of semantics (e.g.
[ -f /tmp/fifo ]
fails in bash) but they could be a solution...true. there are also in-memory files that could be used, although I don't think this would work across different process spaces.
for me, the main benefit would be to deal with low disk space conditions, which is quite common on my machines: i often cram the disk almost to capacity with good stuff i want to listen to later... git-annex allows me to freely remove stuff when i need the space, but it often means i am close to 99% capacity on the media drives i use.
that is true.
presented like that, it's true that the benefits of streaming are not good enough to justify the complexity - the only problem is large files and low local disk space... but maybe we can delegate that solution to the user: "free up at least enough space for one of those files you want to transfer".
[... -J magic stuff ...]
My thoughts exactly: actually copying the files to the local repo introduces all sorts of weird --numcopies nastiness and race conditions, it seems to me.
thanks for considering this!
A solution to this subproblem would transparently fall out of a facility for logically dropping files, which was briefly talked about a long time ago. Just mark the file as logically dropped. If the user
git annex get
s it while the copy-out is in progress, its status will change to "present", so copy will know not to physically delete it.(Of course there are race conditions involved, but I presume/hope that they're no worse than git-annex already has to deal with.)
@JasonWoof re your use case, you don't need transitive transfers. You can simply go to
old_external
and run:(Assuming that the
old_external
has amain_external
remote.)This --branch option was added in a recent git-annex release.
One more use case: running git-annex inside a VirtualBox VM (Linux guest on Windows host), with the repo inside the VM, operating on files on the host through shared folders. Shared folder is too large to copy inside the VM. With the recent addition of importing from directory special remote without downloading (thanks @joeyh for implementing that!), can import the files into the repo, bypassing ?issues with VirtualBox shared folder filesystem. But can't upload the files to a cloud special remote, without first
git-annex-get
ting them into the VM, and I unfortunately don't have space for two local copies of each file. So a direct transfer from a directory special remote to a cloud special remote would help a lot.The alternative, of course, is to run git-annex directly on Windows, but I've run into speed issues with that, and using WSL means losing ability to run VirtualBox, so running git-annex from a guest VM and operating on host files through the shared folder filesystem -- broken as that is -- seems like the best option right now.
git-annex copy --from foo --to bar
is implemented now (and move too).But it does need to download a copy of each file in turn, before uploading it. Then the local copy can be deleted. Without streaming, that's the best that can be done. And I think it's probably good enough for most uses.
So this todo is only about streaming now.