git-annex can transfer data to and from configured git remotes. Normally those remotes are normal git repositories (bare and non-bare; local and remote), that store the file contents in their own git-annex directory.
But, git-annex also extends git's concept of remotes, with these special types of remotes. These can be used by git-annex to store and retrieve the content of files.
- adb (for Android devices)
- Amazon Glacier
- bittorrent
- bup
- ddar
- directory
- gcrypt (encrypted git repositories!)
- git-lfs
- hook
- rsync
- S3 (Amazon S3, and other compatible services)
- tahoe
- tor
- web
- webdav
- git
- httpalso
- borg
- rclone
The above special remotes are built into git-annex, and can be used to tie git-annex into many cloud services.
Here are specific instructions for using git-annex with various services:
- Amazon Glacier
- Amazon S3
- Backblaze B2
- Box.com
- Ceph
- chef-vault
- Dropbox
- FTP
- Flickr
- Freenet and Siacoin Skynet
- Google Cloud Storage
- Google Drive
- hubiC
- IMAP
- Internet Archive via S3
- ipfs
- Jottacloud
- Mega
- Microsoft Azure Blob Storage
- Microsoft OneDrive
- NNCP
- OpenDrive
- Openstack Swift / Rackspace cloud files / Memset Memstore
- OwnCloud
- pCloud
- QingStor
- SFTP
- SkyDrive
- smb / sftp
- Usenet
- Yandex Disk
If a service is not mentioned above, it's worth checking if rclone supports it, then you can use the rclone special remote.
Want to add support for something else? Write your own!
Ways to use special remotes
There are many use cases for a special remote. You could use it as a backup. You could use it to archive files offline in a drive with encryption enabled so if the drive is stolen your data is not. You could git annex move --to specialremote large files when your local drive is getting full, and then git annex move the files back when free space is again available. You could have one repository copy files to a special remote, and then git annex get them on another repository, to transfer the files between computers that do not communicate directly.
None of these use cases are particular to particular special remote types. Most special remotes can all be used in these and other ways. It largely doesn't matter for your use what underlying transport the special remote uses.
Setting up a special remote
To initialize a new special remote, use git-annex initremote. See the documentation for the special remote you want to use for details about configuration and examples of how to initremote it.
Once a special remote has been initialized, other clones of the repository can also enable it, by using git-annex enableremote with the same name that was used to initialize it. (Run the command without any name to get a list of available special remotes.)
Initializing or enabling a special remote adds it as a remote of your git repository.
Storing a git repository in a special remote
Most special remotes do not include a clone of the git repository
by default, so you can't use commands like git push
and git pull
with them. (There are some exceptions like git-lfs.)
But it is possible to store a git repository in many special remotes,
using the git-remote-annex command. This involves configuring
the remote with an "annex::" url. It's even possible to git clone
from a special remote using such an url. See the documentation of
git-remote-annex for details.
Unused content on special remotes
Over time, special remotes can accumulate file content that is no longer
referred to by files in git. Normally, unused content in the current
repository is found by running git annex unused
. To detect unused content
on special remotes, instead use git annex unused --from
. Example:
$ git annex unused --from mys3
unused mys3 (checking for unused data...)
Some annexed data on mys3 is not used by any files in this repository.
NUMBER KEY
1 WORM-s3-m1301674316--foo
(To see where data was previously used, try: git log --stat -S'KEY')
(To remove unwanted data: git-annex dropunused --from mys3 NUMBER)
$ git annex dropunused --from mys3 1
dropunused 12948 (from mys3...) ok
Removing special remotes
Like git remotes, a special remote can be removed from your repository
by using git remote remove
. Note that does not delete the special remote,
or prevent other repositories from enabling or using it.
Testing special remotes
To make sure that a special remote is working correctly, you can use the git annex testremote command. This expects you to have set up the remote as usual, and it then runs a lot of tests, using random data. It's particularly useful to test new implementations of special remotes.
Similar to a JABOD, this would be Just A Bunch Of Files. I already have a NAS with a file structure conducive to serving media to my TV. However, it's not capable (currently) of running git-annex locally. It would be great to be able to tell annex the path to a file there as a remote much like a web remote from "git annex addurl". That way I can safely drop all the files I took with me on my trip, while annex still verifies and counts the file on the NAS as a location.
There are some interesting things to figure out for this to be efficient. For example, SHAs of the files. Maybe store that in a metadata file in the directory of the files? Or perhaps use the WORM backend by default?
Would it be possible to support Rapidshare as a new special remote? They offer unlimited storage for 6-10€ per month. It would be great for larger backups. Their API can be found here: http://images.rapidshare.com/apidoc.txt
Is there any chance a special remote that functions like a hybrid of 'web' and 'hook'? At least in theory, it should be relatively simple, since it would only support 'get' and the only meaningful parameters to pass would be the URL and the output file name.
Maybe make it something like git config annex.myprogram-webhook 'myprogram $ANNEX_URL $ANNEX_FILE', and fetching could work by adding a --handler or --type parameter to addurl.
The use case here is anywhere that simple 'fetch the file over HTTP/FTP/etc' isn't workable - maybe it's on rapidshare and you need to use plowshare to download it; maybe it's a youtube video and you want to use youtube-dl, maybe it's a chapter of a manga and you want to turn it into a CBZ file when you fetch it.
Sorry if it is RTFM... If I have multiple original (reachable) remotes, how could I establish my preference for which one to be used in any given location?
usecase: if I clone a repository within amazon cloud instance -- I would now prefer if this (or all -- user-wide configuration somehow?) repository 'get's load from URLs originating in the cloud of this zone (e.g. having us-east-1.s3.amazonaws.com/ in their URLs).
This should be implemented with costs
I refer you too: http://git-annex.branchable.com/design/assistant/blog/day_213__costs/
This has been implemented in the assistant, so if you use that, changing priority should be as simple as changing the order of the remotes on the web interface. Whichever remote is highest on the list, is the one your client will fetch from.
remote.<name>.annex-cost
to appropriate values. See also the documentation for theremote.<name>.annex-cost-command
which allows your own code to calculate costs.Thank you -- that is nice!
Could costs be presented in 'whereis' and 'status' commands? e.g. like we know APT repositories priorities from apt-cache policy -- now I do not see them (at least in 4.20130501... updating to sid's 0521 now)
Is there any remote which would not only compress during transfer (I believe rsync does that, right?) but also store objects compressed?
I thought bup would do both -- but it seems that git annex receives data uncompressed from a bup remote, and bup remote requires ssh access.
In my case I want to make publicly available files which are binary blobs which could be compressed very well. It would be a pity if I waste storage on my end and also incur significant traffic, which could be avoided if data load was transferred compressed. May be HTTP compression (http://en.wikipedia.org/wiki/HTTP_compression) could somehow be used efficiently for this purpose (not sure if load then originally could already reside in a compressed form to avoid server time to re-compress it)?
ha -- apparently it is trivial to configure apache to serve pre-compressed files (e.g. see http://stackoverflow.com/questions/75482/how-can-i-pre-compress-files-with-mod-deflate-in-apache-2-x) and they arrive compressed to client with
Content-Encoding: gzip
but unfortunately git-annex doesn't like those (fails to "verify") -- do you think it could be implemented for web "special remotes"? that would be really nice -- then I could store such load on another website, and addurl links to the compressed content
All special remotes store files compressed when you enable encryption. Not otherwise, though.
As far as the web special remote and pre-compressed files, files are downloaded from the web using
wget
or (of wget is not available)curl
. So if you can make it work with those commands, it should work.FWIW -- eh -- unfortunately it seems not that transparent. wget seems to not support decompression at all, curl can do with explicit --compressed, BUT it doesn't distinguish url to a "natively" .gz file and pre-compressed content. And I am not sure if it is possible to anyhow reliably distinguish the two urls. In the case of obtaining pre-compressed file from my sample apache server the only difference in the http response header is that it gets "compound" ETag: compare ETag: "3acb0e-17b38-4dd5343744660" (for directly asking zeros100.gz) vs "3acb0e-17b38-4dd5343744660;4dd5344e1537e" (requesting zeros100) where portion past ";" I guess signals the caching tag for gzipping, but not exactly sure on that since it seems to be not a part of standard. Also for zeros100 I am getting "TCN: choice"... once again not sure if that is any how reliably indicative for my purpose. So I guess there is no good way ATM via Content-Type request.
Is there a unit test or integration test to check for the behavior of a special remote implementation and/or validity?
I don't speak Haskell, so maybe there are some in the source but maybe I wouldn't recognize, so I haven't checked. If there are any tests how should I use it?
Thank you, Bence
Hi Joey,
I am thinking about using google drive as an encrypted backup for my important files. However, I fear that if all my git annex repositories are unrecoverable that the encrypted data on the special remote will not help me much. Assuming I have backed up my gpg key I still get a bunch of decrypted files but the folder structure is lost. Would it be possible to implement something like a safety feature that also uploads an (encrypted) tar of all symlinks (pointing to the respective encrypted files) of the (current/or master-branch) git working tree?
I am almost sure this is already implementable using hooks however I could not find information on which types of hooks are available. I am looking for one that is triggered once after all copy/move operations to a special remote are finished. Can you point me in the right direction?
Marek
@donkeyicydragon one way to accomplish that would be to just tar up
.git
-- excluding.git/annex/objects
and add that to git-annex like any other file. You could make a git post-commit hook that does that, but that seems overboard.Or, you could just make a git clone of your repo to a local removable drive, and use that as a local backup.
I'm using git annex assistant to auto backup my pictures off-site to glacier. The files in glacier are encrypted. However, if I lose my main machine, I've also lost the encryption key, which makes my off-site backup useless. I figured I could fix this by creating a manual mode remote on a usb drive that I keep on my keychain. I figured this would replicate the encryption key (as I might want to pull down files from the full backup glacier remote), but would not replicate the files themselves, as I have more pictures than space on the usb drive.
However, it seems the that the new remote is configured to only talk to my main machine and not glacier; the encryption key is not in the .git/ directory. How do I ensure that I've got an off-site copy of the glacier encryption key?
Thanks, Craig
@craig, all of git-annex's information about a special remote is stored in the git-annex branch in git, so any clone of the git repository is sufficient to back that up. You can run
git annex enableremote
in an clone to enable an existing special remote.The only catch is that, if you have chosen to initremote a special remote using a gpg key, with
keyid=whatever
, you'll of course also need that gpg key to to use it. If you rungit annex info $myremote
it will tell you amoung other things, any gpg keys that are used by that remote.Cool, thanks. I see the gpg key in remote.log in the git-annex branch, so it's saved, which is the thing I care about most. I'm now sure I could recover my data in a DR scenario. However, I seem to be missing something with enableremote and how this is all supposed to work.
My main repo is ~/local/pics and here's the result of git annex info glacier:
I used git annex assistant to create a manual mode remote on my usb key. This created a annex-pics directory on the usb key with a bare repo.
I then did a git clone from the bare repo into a tmp dir:
But when I enable the glacier remote, which I'd have to do in a DR scenario, I get an error:
It knows about the remote, but hasn't assigned a name to it:
Doing a git annex info on the uuid does something, but I'm not clear what it does:
An enable remote on the uuid doesn't work either:
I feel like I'm missing a step. What am I missing?
Thanks, Craig
@craig, this can be slightly confusing, since
git-annex enableremote
uses the same name that you used when creating the remote in the first place, withgit-annex initremote
... which might be different than the name used for that remote in some repository or other, and from the description shown ingit annex into
.Since every remote listed by
git annex info
is apparently a regular git repo, not a special remote, with the exception of the glacier one, process of deduction suggests that the "gitannexpics" special remote is the same as the glacier one.I've made some changes now, so
git annex enableremote
will list the uuid and description, along with the name used by enableremote, and will accept any one of those things to specify which remote to enable.Backblaze B2 with unlimited storage at $.005/GB/mo seems to be a great option for a special remote. Is it feasible to add support for it? I'd love to contribute financially.
https://www.backblaze.com/b2
@openmedi git-annex doesn't currently keep track of how much space it's using on a special remote. It's actually quite a difficult problem to do that in general, since multiple distributed clones of a repository can be uploading to the same special remote at the same time.
If it runs out of space and transfers fail, git-annex will handle the failures semi-gracefully, which is to say nothing will stop it from trying again or trying to send other data, but it will certianly be aware that files are not reaching the special remote.
If a particular storage service has a way to check free space, it would not be hard to make git-annex's special remote implementation check it and avoid trying transfers when it's full.
I'm trying to write a remote (for smb:// support via GNOME's Gvfs), and I can't seem to find a way to change an existing special remote's parameters?
Even when marked as "dead" (the closest to deleting a remote that I could find), it still blocks subsequent
annex initremote
calls with the same name.Also, ideally I'd want to reuse the same name and keep the same UUID (e.g. when the backend is moved/renamed). Though of course there are situations where a new UUID would be wanted as well... (I guess that could depend on whether the remote is currently "dead" or not?)
Hmm, I just found that
annex enableremote
accepts parameters to be modified; that should work for now.Though I still wonder about situations where one wants to add a new empty remote with new UUID, but reuse the old name...
@grawity yes, enableremote is the way to change configuration of an existing special remote.
The special remote names are a bit funky; to keep the user from needing to enter in a long uuid when enabling a particular special remote, a name has to be recorded for the remote, and that becomes shared across clones of that repository, in a way that the names of git remotes are not normally. (Normally, my "origin" might be your "upstream" etc.)
While it could ignore dead remotes when initializing a new remote with an existing name, then if the old remote got brought back from the dead, there would be a naming conflict. So, I think it's best to not go down that path, to avoid the undead horrors lurking there.
Looks like it's not possible to set the annex cost of the "web" and "bittorrent" special remotes annex cost.
This doesn't seem to work:
[remote "web"] annex-cost = 500
Perhaps, because these remotes don't actually exist.
Setting annex cost for a webdav remote works but is incremented by 50 for some reason.
@bec.watson, better to open a bug report for this kind of problem.
Seems that the AWS library that git-annex is using does not use V4 authorization yet. Work is in progress: https://github.com/aristidb/aws/pull/199
As to the endpoint hostname, that must be a special case for China, so I've made git-annex aware of that special case.
Hey,
I'm setting up some git repos as part of a FAI [0] build, and part of this process is running "git annex initremote" to add an rsync remote.
git annex is trying to connect to the remote, but as this an automated build there are no ssh keys available, and warnings etc are spat out. Is it possible to instruct initremote to not connect to the remote as part of this process?
Cheers, Andrew
[0] http://fai-project.org
@andrew while it might be possible in theory to set up a rsync special remote that's usable without connecting to the rsync server yet, there are other types of special remotes that do need to connect to the server. S3 comes to mind; it needs to either create a bucket or check if an existing bucket is already being used as a git-annex remote.
So, I don't think this can be supported generally across all special remote implementations. We could try to support it for specific ones like rsync. I don't actually see anywhere in the code for the rsync special remote where initremote will connect to the remote server though.
Hi,
So while writing the globus special remote (git-annex-remote-globus) I often redirect my logs to annex. Nevertheless these logs are always logged out in the console, them being INFO, ERROR, DEBUG and I would like to control that. Is there a way to disable console logging of logs sent back to git annex?
Something like ANNEX_LOG_LEVEL=self.annex.ERROR
Thanks !!
Regards Giulia
@giuly.ippoliti this is not the best place to ask.. external special remote protocol is a better place to discuss special remote implementation.
Anyway, git-annex's --quiet option will shut up the INFO. DEBUG is only ever displayed when you use the --debug option. ERROR should only be used if you have a problem that the user is going to care about seeing.
I have a few special remotes configured in my git annex-ed repo. As far as
git
is concerned, these are just ordinary remotes (they have entries in the.git/config
file). As a result, when I do something likegit fetch --all
it still tries to fetch from them. This of course fails (since they aren't actually git repositories). It isn't the end of the world, but causes some extra noise in my output (I usemagit
in emacs for most of mygit
workflow, so I have to drill down a bit to see what actually failed and then dismiss it as an expected failure).It'd be nice if I could just prevent
git
from fetching from these special remotes. Is there a clever way I can setremote.<name>.fetch
in the local config so as to leave these remotes alone when fetching?@Dan, set remote.name.skipFetchAll to true. Or make a remote group containing the remotes you do want to fetch from.
I mean, I made it do so already, but since this is not a bug tracking system, but an increasingly too long set of disjoint comments, I didn't want to follow up here to say that.
Please, if you have some question about special remotes, post it in the forum. If you have an improvement, post a todo item. If you have a problem, file a bug report. If you're commenting on this page, please think about all the subsequent visitors who will see a huge comment thread and how they will react.
git-annex-dead
remotenameI've tried git-annex-dead, it removes the remotes from command outputs but the name is still taken. This isn't great if I want to delete and reconfigure the same remote.
I can always just call it remote2, but that's unaesthetic.
git-annex-renameremote
. (And a old post on the topic, but that's beforegit-annex-renameremote
was added.)It would be possible for initremote to allow reusing a name that is used for an old remote that has been marked as dead. The reason that is not allowed, I think, is that there is some footgun potential: If the old remote somehow was lost and then was found, it would either be hard to unmark it as dead, or if that was done, there would be 2 remotes with the same name, which would complicate using enableremote.
So it seems best to rename the old one before creating the new one. That way it's got a unique name still.
@joey While inspecting my special remote (rsync, encryption=hybrid) I noticed that a couple files ended up in the same folder tree, is that normal or is something wrong? Obfuscated tree output below.
@gaknuyardi that is expected, they are hash directories. You can see the same effect in .git/annex/object/ hash directories when there are enough objects.