how it works

This page gives a high-level view of how git-annex works. For a detailed low-level view, see the man page and internals.

You do not need to read this page to get started with using git-annex. The walkthrough provides step-by-step examples, and workflow discusses different ways you can use git-annex.

Still reading? Ok. Git's man page calls it "a stupid content tracker". With git-annex, git is instead "a stupid filename and metadata tracker". The contents of annexed files are not stored in git, only the names of the files and some other metadata remain there.

The contents of the files are kept by git-annex in a distributed key/value store consisting of every clone of a given git repository. That's a fancy way to say that git-annex stores the actual file content somewhere under .git/annex/. (See internals for details.)

That was the values; what about the keys? Well, a key is calculated for a given file when it's first added into git-annex. Normally this uses a hash of its contents, but various backends can produce different sorts of keys. The file that gets checked into git is just a symlink to the key under .git/annex/. If the content of a file is modified, that produces a different key (and the symlink is changed).

A file's content can be transferred from one repository to another by git-annex. Which repositories contain a given value is tracked by git-annex (see location tracking). It stores this tracking information in a separate branch, named "git-annex". All you ever do with the "git-annex" branch is push/pull it around between repositories, to sync up git-annex's view of the world.

That's really all there is to it. Oh, there are special remotes that let values be stored other places than git repositories (anything from Amazon S3 to a USB key), and there's a pile of commands listed in the man page to handle moving the values around and managing them. But if you grok the description above, you can see through all that. It's really just symlinks, keys, values, and a git-annex branch to store additional metadata.

Next: install or walkthrough

RSS Atom

minor suggestion

The contents of large files are not stored in git, only the names of the files and some other metadata remain there.

Would this read better to the newbie as:

The contents of 'annexed' files are not stored in git, only the names of the files and some other metadata remain there.

First time for me, the note about large files made me think that maybe annex operated on files above a certain size.

Comment by Nigel — Sat Aug 10 14:31:31 2013

Remove comment

clarification about what is moved / stored and where

Just to support Nigel's comment; it's good to be precise and clear about what happens to the files from the start. I've sent a similar suggestion to the mailing list.

Comment by Matthew — Mon Feb 10 12:53:44 2014

Remove comment

What do you mean by "git-annex" branch?

Branches usually mean different versions of your repo, not entirely different content! Why are branches being used for this?

Comment by G.nius.ck — Sat Jan 9 16:04:29 2016

Remove comment

comment 4

A git branch can be used to name any tree ref. In this case we're using "git-annex" as the name of a branch that is not connected to the rest of the content in the repository.

Comment by joey — Mon Jan 11 16:21:00 2016

Remove comment

So where are the file stored?

This is like a 1000ft overview, but doesn't actually say where the files are actually stored or how they're synchronized.

Does one need to setup a samba, sftp, or AWS bucket to contain the large files? Does a clone of the repo full down all of the large files, or just the files in the working directory that's checked out? Are files transferred via direct connection to other repos (ex the same SSH tunnel that git uses, http, etc) or is there a UDP p2p layer like syncthing or bittorrent that might struggle with certain NAT situations?

The sentence "A file's content can be transferred from one repository to another by git-annex. Which repositories contain a given value is tracked by git-annex (see location tracking)." makes it sound like the old versions of the large files only exist on computers that checked out those copies. Does this mean old versions of a file might be lost forever if a single clone is deleted and temporarily unavailable if clones that contain those revisions of the file are offline?

Is there a way to ensure that a clone has all copies of all of the files (for example, when using git with a central trusted server)?

Comment by git-annex.branchable.com — Wed Aug 18 20:19:25 2021

Remove comment

comment 6

This is like a 1000ft overview, but doesn't actually say where the files are actually stored or how they're synchronized.

It does: "[...] That's a fancy way to say that git-annex stores the actual file content somewhere under .git/annex/. (See internals for details.)".
When using SHA256E hashing (the default), a file will end up for example under .git/annex/f87/4d5/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.

Does one need to setup a samba, sftp, or AWS bucket to contain the large files?

No.

Does a clone of the repo pull down all of the large files, or just the files in the working directory that's checked out?

No and no. You decide yourself (via git-annex-get/git-annex-drop/git-annex-copy/git-annex-move) or automated (via git annex sync --content/git-annex-wanted/git-annex-preferred-content) what large files are stored where. If a file is not present you will just see a broken symlink.

Are files transferred via direct connection to other repos (ex the same SSH tunnel that git uses, http, etc) or is there a UDP p2p layer like syncthing or bittorrent that might struggle with certain NAT situations?

Yes and no. git-annex is very flexible, it can also communicate via tor.

The sentence "A file's content can be transferred from one repository to another by git-annex. Which repositories contain a given value is tracked by git-annex (see location tracking)." makes it sound like the old versions of the large files only exist on computers that checked out those copies. Does this mean old versions of a file might be lost forever if a single clone is deleted and temporarily unavailable if clones that contain those revisions of the file are offline?

Yes and yes. If you don't copy the file elsewhere (with the commands mentioned above) before deleting the repo, that version is lost.

Is there a way to ensure that a clone has all copies of all of the files (for example, when using git with a central trusted server)?

git annex get --all/git annex wanted server anything

I strongly suggest you to create a throwaway repository and try things out.

Comment by Lukey — Wed Aug 18 21:07:11 2021

Remove comment

Add a comment