A walkthrough of the basic features of git-annex.
- creating a repository
- adding a remote
- adding files
- renaming files
- getting file content
- transferring files: When things go wrong
- removing files
- removing files: When things go wrong
- modifying annexed files
- using ssh remotes
- using special remotes
- moving file content between repositories
- quiet please: When git-annex seems to skip files
- using tags and branches
- unused data
- fsck: verifying your data
- fsck: when things go wrong
- automatically managing content
This is very straightforward. Just tell it a description of the repository.
# mkdir ~/annex # cd ~/annex # git init # git annex init "my laptop"
Like any other git repository, git-annex repositories have remotes. Let's start by adding a USB drive as a remote.
# sudo mount /media/usb # cd /media/usb # git clone ~/annex # cd annex # git annex init "portable USB drive" # git remote add laptop ~/annex # cd ~/annex # git remote add usbdrive /media/usb/annex
This is all standard ad-hoc distributed git repository setup. The only git-annex specific part is telling it the name of the new repository created on the USB drive.
Notice that both repos are set up as remotes of one another. This lets either get annexed files from the other. You'll want to do that even if you are using git in a more centralized fashion.
# cd ~/annex # cp /tmp/big_file . # cp /tmp/debian.iso . # git annex add . add big_file (checksum...) ok add debian.iso (checksum...) ok # git commit -a -m added
When you add a file to the annex and commit it, only a symlink to
the content is committed to git. The content itself is stored in
.git/annex/ (or in direct mode the file
is left as-is).
# cd ~/annex # git mv big_file my_cool_big_file # mkdir iso # git mv debian.iso iso/ # git commit -m moved
You can use any normal git operations to move files around, or even make copies or delete them.
Notice that, since annexed files are represented by symlinks, the symlink will break when the file is moved into a subdirectory. But, git-annex will fix this up for you when you commit -- it has a pre-commit hook that watches for and corrects broken symlinks.
(Note that if a repository is in direct mode, you can't run normal git
commands in it. Instead, just move the files using non-git commands, and
git annex add and
git annex sync.)
A repository does not always have all annexed file contents available. When you need the content of a file, you can use "git annex get" to make it available.
We can use this to copy everything in the laptop's annex to the USB drive.
# cd /media/usb/annex # git annex sync laptop # git annex get . get my_cool_big_file (from laptop...) ok get iso/debian.iso (from laptop...) ok
Notice that in the previous example,
sync was used. This lets git-annex know what has changed in the other
repositories like the laptop, and so it knows about the files present there and can
Let's look at what the sync command does in more detail:
# cd /media/usb/annex # git annex sync commit nothing to commit (working directory clean) ok pull laptop ok push laptop ok
After you run sync, the git repository will be updated with all changes made to its remotes, and any changes in the git repository will be pushed out to its remotes, where a sync will get them. This is especially useful when using git in a distributed fashion, without a central bare repository. See sync for details.
git annex sync only syncs the metadata about your
files that is stored in git. It does not sync the contents of files, that
are managed by git-annex. To do that, you can use
git annex sync --content
After a while, you'll have several annexes, with different file contents. You don't have to try to keep all that straight; git-annex does location tracking for you. If you ask it to get a file and the drive or file server is not accessible, it will let you know what it needs to get it:
# git annex get video/hackity_hack_and_kaxxt.mov get video/hackity_hack_and_kaxxt.mov (not available) Unable to access these remotes: usbdrive, server Try making some of these repositories available: 5863d8c0-d9a9-11df-adb2-af51e6559a49 -- my home file server 58d84e8a-d9ae-11df-a1aa-ab9aa8c00826 -- portable USB drive ca20064c-dbb5-11df-b2fe-002170d25c55 -- backup SATA drive failed # sudo mount /media/usb # git annex get video/hackity_hack_and_kaxxt.mov get video/hackity_hack_and_kaxxt.mov (from usbdrive...) ok
When you're using git-annex you can
git rm a file just like you usually
would with git. Just like with git, this removes the file from your work
tree, but it does not remove the file's content from the git repository.
If you check the file back out, or revert the removal, you can get it back.
Git-annex adds the ability to remove the content of a file from your local repository to save space. This is called "dropping" the file.
You can always drop files safely. Git-annex checks that some other repository still has the file before removing it.
# git annex drop iso/debian.iso drop iso/Debian_5.0.iso ok
Once dropped, the file will still appear in your work tree as a broken symlink.
You can use
git annex get to as usual to get this file back to your local
Before dropping a file, git-annex wants to be able to look at other remotes, and verify that they still have a file. After all, it could have been dropped from them too. If the remotes are not mounted/available, you'll see something like this.
# git annex drop important_file other.iso drop important_file (unsafe) Could only verify the existence of 0 out of 1 necessary copies Unable to access these remotes: usbdrive Try making some of these repositories available: 58d84e8a-d9ae-11df-a1aa-ab9aa8c00826 -- portable USB drive ca20064c-dbb5-11df-b2fe-002170d25c55 -- backup SATA drive (Use --force to override this check, or adjust numcopies.) failed drop other.iso (unsafe) Could only verify the existence of 0 out of 1 necessary copies No other repository is known to contain the file. (Use --force to override this check, or adjust numcopies.) failed
Here you might --force it to drop
important_file if you trust your backup.
other.iso looks to have never been copied to anywhere else, so if
it's something you want to hold onto, you'd need to transfer it to
some other repository before dropping it.
Normally, the content of files in the annex is prevented from being modified. (Unless your repository is using direct mode.)
That's a good thing, because it might be the only copy, you wouldn't want to lose it in a fumblefingered mistake.
# echo oops > my_cool_big_file bash: my_cool_big_file: Permission denied
In order to modify a file, it should first be unlocked.
# git annex unlock my_cool_big_file unlock my_cool_big_file (copying...) ok
That replaces the symlink that normally points at its content with a copy of the content. You can then modify the file like any regular file. Because it is a regular file.
(If you decide you don't need to modify the file after all, or want to discard
modifications, just use
git annex lock.)
git commit, git-annex's pre-commit hook will automatically
notice that you are committing an unlocked file, and add its new content
to the annex. The file will be replaced with a symlink to the new content,
and this symlink is what gets committed to git in the end.
# echo "now smaller, but even cooler" > my_cool_big_file # git commit my_cool_big_file -m "changed an annexed file" add my_cool_big_file ok [master 64cda67] changed an annexed file 1 files changed, 1 insertions(+), 1 deletions(-)
There is one problem with using
git commit like this: Git wants to first
stage the entire contents of the file in its index. That can be slow for
big files (sorta why git-annex exists in the first place). So, the
automatic handling on commit is a nice safety feature, since it prevents
the file content being accidentally committed into git. But when working with
big files, it's faster to explicitly add them to the annex yourself
# echo "now smaller, but even cooler yet" > my_cool_big_file # git annex add my_cool_big_file add my_cool_big_file ok # git commit my_cool_big_file -m "changed an annexed file"
So far in this walkthrough, git-annex has been used with a remote repository on a USB drive. But it can also be used with a git remote that is truly remote, a host accessed by ssh.
Say you have a desktop on the same network as your laptop and want to clone the laptop's annex to it:
desktop# git clone ssh://mylaptop/home/me/annex ~/annex desktop# cd ~/annex desktop# git annex init "my desktop"
Now you can get files and they will be transferred (using
desktop# git annex get my_cool_big_file get my_cool_big_file (getting UUID for origin...) (from origin...) SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 2159 2.1KB/s 00:00 ok
When you drop files, git-annex will ssh over to the remote and make sure the file's content is still there before removing it locally:
desktop# git annex drop my_cool_big_file drop my_cool_big_file (checking origin..) ok
Note that normally git-annex prefers to use non-ssh remotes, like
a USB drive, before ssh remotes. They are assumed to be faster/cheaper to
access, if available. There is a annex-cost setting you can configure in
.git/config to adjust which repositories it prefers. See
the man page for details.
Also, note that you need full shell access for this to work -- git-annex needs to be able to ssh in and run commands. Or at least, your shell needs to be able to run the git-annex-shell command.
For details on setting up ssh remotes, see the centralized git repository tutorial.
We've seen above that git-annex can be used to store files in regular git remotes, accessed either via ssh, or on a removable drive. But git-annex can also store files in Amazon S3, Glacier, on a rsync server, in WebDAV, or even pull files down from the web and bittorrent. This and much more is made possible by special remotes.
These are not normal git repositories; indeed the git repository is not stored on a special remote. But git-annex can store the contents of files in special remotes, and operate on them much as it would on any other remote. Bonus: Files stored on special remotes can easily be encrypted!
All you need to get started using a special remote is to initialize it.
This is done using the
git annex initremote command, which needs to be
passed different parameters depending on the type of special remote.
Some special remotes also need things like passwords to be set in environment variables. Don't worry -- it will prompt if you leave anything off. So feel free to make any kind of special remote instead of the S3 remote used in this example.
# export AWS_ACCESS_KEY_ID="somethingotherthanthis" # export AWS_SECRET_ACCESS_KEY="s3kr1t" # git annex initremote mys3 type=S3 chunk=1MiB encryption=shared initremote mys3 (shared encryption) (checking bucket) (creating bucket in US) ok
Now you can store files on the newly initialized special remote.
# git annex copy my_cool_big_file --to mys3 copy my_cool_big_file (to mys3...) ok
Once you've initialized a special remote in one repository, you can enable use of the same special remote in other clones of the repository. If the mys3 remote above was initialized on your laptop, you'll also want to enable it on your desktop.
To do so, first get git-annex in sync (so it knows about
the special remote that was added in the other repository), and then
git annex enableremote.
desktop# git annex sync desktop# export AWS_ACCESS_KEY_ID="somethingotherthanthis" desktop# export AWS_SECRET_ACCESS_KEY="s3kr1t" desktop# git annex enableremote mys3 enableremote mys3 (checking bucket) ok
And now you can download files from the special remote:
desktop# git annex get my_cool_big_file --from mys3 get my_cool_big_file (from mys3...) ok
This has only scratched the surface of what can be done with special remotes.
Often you will want to move some file contents from a repository to some
other one. For example, your laptop's disk is getting full; time to move
some files to an external disk before moving another file from a file
server to your laptop. Doing that by hand (by using
git annex get and
git annex drop) is possible, but a bit of a pain.
git annex move
makes it very easy.
# git annex move my_cool_big_file --to usbdrive move my_cool_big_file (to usbdrive...) ok # git annex move video/hackity_hack_and_kaxxt.mov --from fileserver move video/hackity_hack_and_kaxxt.mov (from fileserver...) SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 82MB 199.1KB/s 07:02 ok
One behavior of git-annex is sometimes confusing at first, but it turns out to be useful once you get to know it.
# git annex drop * #
Why didn't git-annex seem to do anything despite being asked to drop all the files? Because it checked them all, and none of them are present.
Most git-annex commands will behave this way when they're able to quickly check that nothing needs to be done about a file.
Running a git-annex command without specifying any file name will make git-annex look for files in the current directory and its subdirectories. So, we can add all new files to the annex easily:
# echo hi > subdir/subsubdir/newfile # git annex add add subdir/subsubdir/newfile ok
When doing this kind of thing, having nothing shown for files that it doesn't need to act on is useful because it prevents swamping you with output. You only see the files it finds it does need to act on.
So remember: If git-annex seems to not do anything when you tell it to, it's not being lazy -- It's checked that nothing needs to be done to get to the state you asked for!
Like git, git-annex hangs on to every old version of a file (by default), so you can make tags and branches, and can check them out later to look at the old files.
# git tag 1.0 # rm -f my_cool_big_file # git commit -m deleted # git checkout 1.0 # cat my_cool_big_file yay! old version still here
Of course, when you
git checkout an old branch, some old versions of
files may not be locally available, and may be stored in some other
repository. You can use
git annex get to get them as usual.
It's possible for data to accumulate in the annex that no files in any
branch point to anymore. One way it can happen is if you
git rm a file
without first calling
git annex drop. And, when you modify an annexed
file, the old content of the file remains in the annex. Another way is when
migrating between key-value backends.
This might be historical data you want to preserve, so git-annex defaults to preserving it. So from time to time, you may want to check for such data:
# git annex unused unused . (checking for unused data...) Some annexed data is no longer used by any files in the repository. NUMBER KEY 1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1 (To see where data was previously used, try: git log --stat -S'KEY') (To remove unwanted data: git-annex dropunused NUMBER) ok
git annex unused, you can follow the instructions to examine
the history of files that used the data, and if you decide you don't need that
data anymore, you can easily remove it from your local repository.
# git annex dropunused 1 dropunused 1 ok
Hint: To drop a lot of unused data, use a command like this:
# git annex dropunused 1-1000
Rather than removing the data, you can instead send it to other repositories:
# git annex copy --unused --to backup # git annex move --unused --to archive
You can use the fsck subcommand to check for problems in your data. What can be checked depends on the key-value backend you've used for the data. For example, when you use the SHA1 backend, fsck will verify that the checksums of your files are good. Fsck also checks that the numcopies setting is satisfied for all files.
# git annex fsck fsck some_file (checksum...) ok fsck my_cool_big_file (checksum...) ok ...
You can also specify the files to check. This is particularly useful if you're using sha1 and don't want to spend a long time checksumming everything.
# git annex fsck my_cool_big_file fsck my_cool_big_file (checksum...) ok
If you have a large repo, you may want to check it in smaller steps. You may start and continue an aborted or time-limited check.
# git annex fsck -S <optional-directory> --time-limit=1m fsck some_file (checksum...) ok fsck my_cool_big_file (checksum...) ok Time limit (1m) reached! # git annex fsck -m <optional-directory> fsck my_other_big_file (checksum...) ok ...
--incremental to start the incremental check. Use
--more to continue the started check and continue where it left
off. Note that saving the progress of
fsck is performed after every
1000 files or 5 minutes or when
--time-limit occours. There may be
files that will be checked again when
git-annex exists abnormally
eg. Ctrl+C and the check is restarted.
Fsck never deletes possibly bad data; instead it will be moved to
.git/annex/bad/ for you to recover. Here is a sample of what fsck
might say about a badly messed up annex:
# git annex fsck fsck my_cool_big_file (checksum...) git-annex: Bad file content; moved to .git/annex/bad/SHA1:7da006579dd64330eb2456001fd01948430572f2 git-annex: ** No known copies exist of my_cool_big_file failed fsck important_file git-annex: Only 1 of 2 copies exist. Run git annex get somewhere else to back it up. failed git-annex: 2 failed
git-annex can be configured to require more than one copy of a file exists, as a simple backup for your data. This is controlled by the numcopies setting, which defaults to 1 copy. Let's change that to require 2 copies, and send a copy of every file to a USB drive.
# git annex numcopies 2 # git annex copy . --to usbdrive
Now when we try to
git annex drop a file, it will verify that it
knows of 2 other repositories that have a copy before removing its
content from the current repository.
The numcopies setting used above is the global default. You can also vary the number of copies needed, depending on the file name. So, if you want 3 copies of all your flac files, but only 1 copy of oggs:
# echo "*.ogg annex.numcopies=1" >> .gitattributes # echo "*.flac annex.numcopies=3" >> .gitattributes
Or, you might want to make a directory for important stuff, and configure it so anything put in there is backed up more thoroughly:
# mkdir important_stuff # echo "* annex.numcopies=3" > important_stuff/.gitattributes
For more details about the numcopies setting, see copies.
Once you have multiple repositories, and have perhaps configured numcopies, any given file can have many more copies than is needed, or perhaps fewer than you would like. How to manage this?
The whereis subcommand can be used to see how many copies of a file are known, but then you have to decide what to get or drop. In this example, there are perhaps not enough copies of the first file, and too many of the second file.
# cd /media/usbdrive # git annex whereis whereis my_cool_big_file (1 copy) 0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop whereis other_file (3 copies) 0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop 62b39bbe-4149-11e0-af01-bb89245a1e61 -- usb drive [here] 7570b02e-15e9-11e0-adf0-9f3f94cb2eaa -- backup drive
What would be handy is some automated versions of get and drop, that only gets a file if there are not yet enough copies of it, or only drops a file if there are too many copies. Well, these exist, just use the --auto option.
# git annex get --auto --numcopies=2 get my_cool_big_file (from laptop...) ok # git annex drop --auto --numcopies=2 drop other_file ok
With two quick commands, git-annex was able to decide for you how to work toward having two copies of your files.
# git annex whereis whereis my_cool_big_file (2 copies) 0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop 62b39bbe-4149-11e0-af01-bb89245a1e61 -- usb drive [here] whereis other_file (2 copies) 0c443de8-e644-11df-acbf-f7cd7ca6210d -- laptop 7570b02e-15e9-11e0-adf0-9f3f94cb2eaa -- backup drive
The --auto option can also be used with the copy command, again this lets git-annex decide whether to actually copy content.
The above shows how to use --auto to manage content based on the number of copies. It's also possible to configure, on a per-repository basis, which content is desired. Then --auto also takes that into account see preferred content for details.
So ends the walkthrough. By now you should be able to use git-annex.
Want more? See tips for lots more features and advice.