internals

In the world of git, we're not scared about internal implementation details, and sometimes we like to dive in and tweak things by hand. Here's some documentation to that end.

The .git/ directory

`.git/annex/objects/aa/bb//`

This is where locally available file contents are actually stored. Files added to the annex get a symlink or pointer file checked into git, that points to the file content.

First there are two levels of directories used for hashing, to prevent too many things ending up in any one directory. See hashing for details.

Each subdirectory has the name of a key in one of the key-value backends. The file inside also has the name of the key. This two-level structure is used because it allows the write bit to be removed from the subdirectories as well as from the files. That prevents accidentally deleting or changing the file contents. See lockdown for details.

`.git/annex/tmp/`

This directory contains partially transferred objects.

`.git/annex/othertmp/`

This is a temp directory for miscellaneous other temp files.

`.git/annex/bad/`

git-annex fsck puts any bad objects it finds in here.

`.git/annex/transfers/`

Contains information files for uploads and downloads that are in progress, as well as any that have failed. Used especially by the assistant. It is safe to delete these files.

`.git/annex/ssh/`

ssh connection caching files are written in here. It is safe to delete these files.

`.git/annex/index`

This is a git index file which git-annex uses to stage files when preparing commits to the git-annex branch.

It's pretty safe to delete this file if git-annex is not currently running. It will be re-created as necessary.

`.git/annex/journal/`

git-annex uses this to journal changes to the git-annex branch, before committing a set of changes.

The git-annex branch

This branch is managed by git-annex, with the contents listed below.

This branch is not connected to your master, etc branches. It it used for internal tracking of information about git-annex repositories and annexed objects.

The files stored in this branch are all designed to be auto-merged by simply concacenating them together. So each line has a timestamp, to allow the most recent information to be identified.

`uuid.log`

Records the UUIDs of known repositories, and associates them with a description of the repository. This allows git-annex to display something more useful than a UUID when it refers to a repository that does not have a configured git remote pointing at it.

The file format is simply one line per repository, with the uuid followed by a space and then the description, followed by a timestamp. Example:

e605dca6-446a-11e0-8b2a-002170d25c55 laptop timestamp=1317929189.157237s
26339d22-446b-11e0-9101-002170d25c55 usb disk timestamp=1317929330.769997s

`numcopies.log`

Records the global numcopies setting.

The file format is simply a timestamp followed by a number.

`mincopies.log`

Records the global mincopies setting.

The file format is simply a timestamp followed by a number.

`config.log`

Records global configuration settings, which can be overridden by values in .git/config.

The file format is a timestamp, followed by the name of the configuration, followed by the value. For example:

1317929189.157237s annex.autocommit false

`remote.log`

Holds persistent configuration settings for special remotes such as Amazon S3.

The file format is one line per remote, starting with the uuid of the remote, followed by a space, and then a series of var=value pairs, each separated by whitespace, and finally a timestamp.

Special remotes that are autoenabled have autoenable=true here.

Encrypted special remotes store their encryption key here, in the "cipher" value. It is base64 encoded, and unless shared encryption is used, is encrypted to one or more gpg keys. The first 256 bytes of the cipher is used as the HMAC SHA1 encryption key, to encrypt filenames stored on the special remote. The remainder of the cipher is used as a gpg symmetric encryption key, to encrypt the content of files stored on the special remote.

`trust.log`

Records the trust information for repositories. Does not exist unless trust values are configured.

The file format is one line per repository, with the uuid followed by a space, and then either 1 (trusted), 0 (untrusted), ? (semi-trusted), X (dead) and finally a timestamp.

Example:

e605dca6-446a-11e0-8b2a-002170d25c55 1 timestamp=1317929189.157237s
26339d22-446b-11e0-9101-002170d25c55 ? timestamp=1317929330.769997s

Repositories not listed are semi-trusted.

`group.log`

Used to group repositories together.

The file format is one line per repository, with the uuid followed by a space, and then a space-separated list of groups this repository is part of, and finally a timestamp.

`preferred-content.log`

Used to indicate which repositories prefer to contain which file contents.

The file format is one line per repository, with the uuid followed by a space, then a boolean expression, and finally a timestamp.

Files matching the expression are preferred to be retained in the repository, while files not matching it are preferred to be stored somewhere else.

`required-content.log`

Used to indicate which repositories are required to contain which file contents.

File format is identical to preferred-content.log.

`group-preferred-content.log`

Contains standard preferred content settings for groups. (Overriding or supplementing the ones built into git-annex.)

The file format is one line per group, starting with a timestamp, then a space, then the group name followed by a space and then the preferred content expression.

`maxsize.log`

Records the maximum combined size of annexed files that can be stored in a repository.

The file format is a timestamp, followed by the UUID of a repository, followed by the size in bytes. For example:

1317929189.157237s e605dca6-446a-11e0-8b2a-002170d25c55 100000000000

`export.log`

Tracks what trees have been exported to special remotes by git-annex-export(1).

Each line starts with a timestamp, then the uuid of the repository that exported to the special remote, followed by a colon (:) and the uuid of the special remote. Then, separated by a spaces, the SHA of the tree that was exported, and optionally any number of subsequent SHAs, of trees that have started to be exported but whose export is not yet complete.

In order to record the beginning of the first export, where nothing has been exported yet, the SHA of the exported tree can be the empty tree (eg 4b825dc642cb6eb9a060e54bf8d69288fbee4904).

For example:

1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 4b825dc642cb6eb9a060e54bf8d69288fbee4904 bb08b1abd207aeecccbc7060e523b011d80cb35b
1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b 
1317929189.157237s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b 7c7af825782b7c8706039b855c72709993542be4
1317923000.251111s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 7c7af825782b7c8706039b855c72709993542be4

(The trees are also grafted into the git-annex branch, at export.tree, to prevent git from garbage collecting it. However, the head of the git-annex branch should never contain such a grafted in tree; the grafted tree is removed in the same commit that updates export.log.)

`aaa/bbb/*.log`

These log files record location tracking information for file contents. These are placed in two levels of subdirectories for hashing. See hashing for details.

The name of the key is the filename, and the content consists of a timestamp, either 1 (present) or 0 (not present) or X (dead), and the UUID of the repository that has or lacks the file content.

Example:

1287290776.765152s 1 e605dca6-446a-11e0-8b2a-002170d25c55
1287290767.478634s 0 26339d22-446b-11e0-9101-002170d25c55

`aaa/bbb/*.log.web`

These log files record urls used by the web special remote and sometimes by other remotes. Their format is similar to the location tracking files, but with urls rather than UUIDs.

`aaa/bbb/*.log.ek`

These log files record other keys that are equivilant to the key used in the filename. This is currently used for the VURL backend. Their format is similar to the location tracking files, but with keys rather than UUIDs.

`aaa/bbb/*.log.rmt`

These log files are used by remotes that need to record their own state about keys. Each remote can store one line of data about a key, in its own format.

Note that only the most recently set state about a key is seen by remotes using this. The log.rmet documented below does not have this limitation.

Example:

1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55 blah blah
1287290767.478634s 26339d22-446b-11e0-9101-002170d25c55 foo=bar

`aaa/bbb/*.log.met`

These log files are used to store arbitrary metadata about keys. Each key can have any number of metadata fields. Each field has a set of values.

Lines are timestamped, and record when values are added (field +value), but also when values are removed (field -value). Removed values are retained in the log so that when merging an old line that sets a value that was later unset, the value is not accidentally added back.

For example:

1287290776.765152s tag +foo +bar author +joey
1291237510.141453s tag -bar +baz

The value can be completely arbitrary data, although it's typically reasonably short. If the value contains any whitespace (including \r or \n), it will be base64 encoded. Base64 encoded values are indicated by prefixing them with "!".

`aaa/bbb/*.log.rmet`

These log files store per-remote metadata about keys. This metadata is only used by the remote.

Format is the same as the metadata log files above, but each metadata key is prefixed with "uuid:" to indicate the remote it belongs to.

For example:

1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:foo +bar
1287290776.765152s 26339d22-446b-11e0-9101-002170d25c55:x +1
1291237510.141453s 26339d22-446b-11e0-9101-002170d25c55:x -1 26339d22-446b-11e0-9101-002170d25c55:x +2

`aaa/bbb/*.log.cid`

These log files store per-remote content identifiers for keys. A given key may have any number of content identifiers.

The format is a timestamp, followed by the UUID of the remote, followed by the content identifiers which are separated by colons. If a content identifier contains a colon or \r or \n, it will be base64 encoded. Base64 encoded values are indicated by prefixing them with "!".

1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55 5248916:5250378

`aaa/bbb/*.log.cnk`

These log files are used when objects are stored in chunked form on remotes. They record the size(s) of the chunks, and the number of chunks.

For example, this logs that a remote has an object stored using both 9 chunks of 1 mb size, and 1 chunk of 10 mb size.

1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
1287290776.765153s e605dca6-446a-11e0-8b2a-002170d25c55:102400 1

(When those chunks are removed from the remote, the 9 is changed to 0.)

`proxy.log`

Used to record what repositories are accessible via a proxy.

Each line starts with a timestamp, then the UUID of the repository that can serve as a proxy, and then a list of the remotes that it can proxy to, separated by spaces.

Each remote in the list consists of a repository's UUID, followed by a colon (:) and then a remote name.

For example:

1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55 26339d22-446b-11e0-9101-002170d25c55:foo c076460c-2290-11ef-be53-b7f0d194c863:bar

`cluster.log`

Used to record the UUIDs of clusters, and the UUIDs of the nodes comprising each cluster.

Each line starts with a timestamp, then the UUID the cluster, followed by a list of the UUIDs of its nodes, separated by spaces.

For example:

1317929100.012345s 5b070cc8-29b8-11ef-80e1-0fd524be241b 5c0c97d2-29b8-11ef-b1d2-5f3d1c80940d 5c40375e-29b8-11ef-814d-872959d2c013

`schedule.log`

Used to record scheduled events, such as periodic fscks.

The file format is simply one line per repository, with the uuid followed by a space and then its schedule, followed by a timestamp.

There can be multiple events in the schedule, separated by "; ".

The format of the scheduled events is the same described in git-annex-schedule.

Example:

42bf2035-0636-461d-a367-49e9dfd361dd fsck self 30m every day at any time; fsck 4b3ebc86-0faf-4892-83c5-ce00cbe30f0a 1h every year at any time timestamp=1385646997.053162s

`activity.log`

Used to record the times of activities, such as fscks.

Example:

42bf2035-0636-461d-a367-49e9dfd361dd Fsck timestamp=1422387398.30395s

`transitions.log`

Used to record transitions, eg by git annex forget

Each line of the file is a transition, followed by a timestamp.

Example:

ForgetGitHistory 1387325539.685136s
ForgetDeadRemotes 1387325539.685136s

`difference.log`

Used when a repository has fundamental differences from other repositories, that should prevent merging.

Example:

e605dca6-446a-11e0-8b2a-002170d25c55 [ObjectHashLower] timestamp=1422387398.30395s

`multicast.log`

Records uftp public key fingerprints, for use by git-annex-multicast.

`migrate.tree/old` and `migrate.tree/new`

These are used to record migrations done by git-annex migrate. By diffing between the two, the old and new keys can be determined. This lets migrations be recorded while using a minimum of space in the git repository. The filenames in these trees have no connection to the names of actual annexed files.

These trees are recorded in history of the git-annex branch, but the head of the git-annex branch will never contain them.

Other internals documentation

git-remote-annex documents how git repositories are stored on special remotes when using git with "annex::" urls.

RSS Atom

tmp missing

There's no information about .git/annex/tmp here.

Comment by stoile — Sat Nov 16 12:16:48 2013

Remove comment

comment 2

.git/annex/tmp is not very interesting. It's a temporary file directory. When transferring a key's content, git-annex uses a stable filename, which allows resuming interrupted downloads, or cleaning up aborted downloads with git annex unused.

Comment by joeyh.name — Sat Nov 16 17:23:02 2013

Remove comment

comment 3

Some documentation that would be nice having added in the appropriate places:

Exactly how is the index file being used? Is this just a copy of the git index file that we would get when checking out the git-annex branch? Why is the file kept around? Which sort of operations are done on the index, e.g. when doing git annex add?
For all time-stamped data-structures: Exactly which significance does the time-stamp have? E.g., for uuid.log, is this the date the name was changed?

Comment by zardoz — Mon Sep 15 10:34:36 2014

Remove comment

.git/annex/tmp third-party use?

can .git/annex/tmp be used by third party software to import stuff in git-annex? the idea here would be to accept uploads from a web form in .git/annex/tmp then move it into place in the proper location once the upload is complete (then do a git-annex-add or let the assistant import it). --anarcat

Comment by anarcat [id.koumbit.net] — Tue Jun 9 20:21:39 2015

Remove comment

comment 1

It's ok to put files in .git/annex/tmp if they're formatted as git-annex key filenames. Of course, you should avoid overwriting the content of files already there.

Files not formatted as keys should be kept out of .git/annex/tmp; it's ok to put them in .git/annex/misctmp.

Comment by joey — Tue Jun 9 20:25:03 2015

Remove comment

.git/annex/misctmp very large

Why is .git/annex/misctmp so large ? Currently I use git annex to manage pytorch models, basically I have a large amount (1500 folder) of 4 Kilobytes files, some files are are bigger, misctmp occupies 6.2 GB, is it ok ?

PS. Sorry, if I write this here but failed to post to the forum.

Comment by arseny-n — Wed Jan 17 13:04:52 2018

Remove comment

comment 7

@arseny-n the misctemp directory does not normally contain anything, or only temp files in use by the currently running git-annex process for a short amount of time. The only way I know of that it can get files piled up in it is when you kill the git-annex process while it's using such a file.

It's always safe to delete the files in misctemp as long as git-annex is not running. Also, the names of the files should give a pretty good clue about what git-annex was using the file for. For example "jlog" files are used for staging the journal.

Comment by joey — Thu Feb 22 16:56:13 2018

Remove comment

comment 8

"Each subdirectory has the name of a key in one of the key-value backends. The file inside also has the name of the key." -- is it necessary for the file inside to also have the name of the key? Repeating the already long key name leads to very long symlink targets. Could the file inside just be 'f.txt' (or whatever the extension is)?

Also, the terms "key" and "name of the key" are used in various places; are these the same thing?

Comment by Ilya_Shlyakhter — Wed Sep 19 16:07:44 2018

Remove comment

representing unlocked state of files

Is the locked/unlocked state of a file represented somewhere? Or does git-annex just assume that any file whose contents starts with /annex/objects/ is a pointer file?

Comment by Ilya_Shlyakhter — Thu Sep 19 18:02:31 2019

Remove comment

duplicate objects?

Do I understand correctly that in .git/annex/objects dir there should be no duplicates? Here follows a run of 'rdfind' done in the objects dir:

$ rdfind .
Now scanning ".", found 12874 files.
Now have 12874 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 75579281486 bytes or 70 GiB
Removed 8376 files due to unique sizes from list.4498 files left.
Now eliminating candidates based on first bytes:removed 68 files from list.4430 files left.
Now eliminating candidates based on last bytes:removed 66 files from list.4364 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.4364 files left.
It seems like you have 4364 files that are not unique
Totally, 10 GiB can be reduced.
Now making results file results.txt

And here is an example pair of dupes (excerpt from the abovementioned 'results.txt'):

DUPTYPE_FIRST_OCCURRENCE 2073 3 86558 26 21057567 1 ./53/zv/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.jpg/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.jpg
DUPTYPE_WITHIN_SAME_TREE -2073 3 86558 26 1080608 1 ./7w/w2/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.56.jpeg/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.56.jpeg

Any clues?

Thank you

Comment by atrent — Sat Nov 30 14:04:17 2019

Remove comment

same contents with different keys

@atrent -- some backends (like SHA256E) base the key not just on object contents, but also on part of its filename (the extension). So the same content can exist with two different keys. In your example, the same contents exists in one file ending with .jpg and in another ending with .56.jpeg . (This is done to give the annexed contents the same extension as the original file had before annexing, to avoid confusing some programs). There are also backends like WORM and URL, not based on checksums, that could lead to different keys with same contents. There could also be same contents added under different backends (see also git-annex-migrate). Finally, there is the theoretical possibility of hash collisions.

Comment by Ilya_Shlyakhter — Sat Nov 30 16:51:58 2019

Remove comment

no collisions

I can confirm that these are not collisions: these identical files are the same photos with different names, shame on Dropbox syncing from my smartphone. I was actually hoping to dedupe through git-annex ;-)

Some more questions/suggestions/conversation-starters:

I suppose I can dedup them with rdfind (i.e., hardlinking identical files), do you foresee any side effects?
may I change the hash function of git-annex to something not depending on filenames? (I suppose so, I'll have a look at the docs)
if I can change the hash function can I regenerate the whole annex without re-creating it? (again I'll have a look at docs)

Thanks

Comment by atrent — Sat Nov 30 20:37:00 2019

Remove comment

comment 13

git-annex-migrate to a backend not ending in E (e.g. SHA256 not SHA256E), then git-annex-unused to drop the old keys.

Comment by Ilya_Shlyakhter — Sat Nov 30 21:11:53 2019

Remove comment

hardlinking identical files in annex may break invariants

P.S. Re: hardlinking identical files -- git-annex ?keeps track of inodes where contents is stored, so deleting a file might make that info stale. Also, dropping one key will drop another key's contents without updating location tracking info. And dropping then getting files would lead to two separate copies again. So I wouldn't recommend that.

Comment by Ilya_Shlyakhter — Sat Nov 30 21:36:38 2019

Remove comment

migrating...

I'm git-annex-migrating (to SHA256) now, thank you for all suggestions!

Comment by atrent — Sat Nov 30 22:30:06 2019

Remove comment

Re: representing unlocked state of files

Starting with "/annex/objects" is not enough, the remainder of the first line of the file has to parse as a git-annex key for it to be a pointer file.

Comment by joey — Thu Feb 20 20:23:41 2020

Remove comment

why othertmp to be on the same file system?

Joey, you said

.git/annex/othertmp has to be on the same filesystem as the work tree and git repository.

is that for atomic/quick renames important for adjusted branch unlocked mode? asking because in datalad we have that "wreckless" mode where we symlink an entire .git/annex from original clone to e.g. have quick throw away clone to access the data and possibly even without modifying state of any annexed key.

Or how else are we stepping on the shovel here?

Comment by yarikoptic — Tue Dec 13 14:15:28 2022

Remove comment

re: why othertmp to be on the same file system?

I've audited the code and the only place I could find where it did not work to have othertmp on a different filesystem is in the bittorrent special remote when it downloads a torrent file. But that also failed when .git/annex/tmp was on a different filesystem! (Since it was moving between the two directories.) I've fixed that.

It's still best to keep things on the same filesystem because cross-filesystem moves can be expensive and it sometimes falls back to less ideal behavior in other ways too when operating across filesystems. Also of course, you avoid being the one who gets to find and report breakage like the above..

Comment by joey — Tue Dec 20 18:39:35 2022

Remove comment

Add a comment

The .git/ directory

.git/annex/objects/aa/bb/*/*

.git/annex/tmp/

.git/annex/othertmp/

.git/annex/bad/

.git/annex/transfers/

.git/annex/ssh/

.git/annex/index

.git/annex/journal/

The git-annex branch

uuid.log

numcopies.log

mincopies.log

config.log

remote.log

trust.log

group.log

preferred-content.log

required-content.log

group-preferred-content.log

maxsize.log

export.log

aaa/bbb/*.log

aaa/bbb/*.log.web

aaa/bbb/*.log.ek

aaa/bbb/*.log.rmt

aaa/bbb/*.log.met

aaa/bbb/*.log.rmet

aaa/bbb/*.log.cid

aaa/bbb/*.log.cnk

proxy.log

cluster.log

schedule.log

activity.log

transitions.log

difference.log

multicast.log

migrate.tree/old and migrate.tree/new

Other internals documentation

`.git/annex/objects/aa/bb//`

`.git/annex/tmp/`

`.git/annex/othertmp/`

`.git/annex/bad/`

`.git/annex/transfers/`

`.git/annex/ssh/`

`.git/annex/index`

`.git/annex/journal/`

`uuid.log`

`numcopies.log`

`mincopies.log`

`config.log`

`remote.log`

`trust.log`

`group.log`

`preferred-content.log`

`required-content.log`

`group-preferred-content.log`

`maxsize.log`

`export.log`

`aaa/bbb/*.log`

`aaa/bbb/*.log.web`

`aaa/bbb/*.log.ek`

`aaa/bbb/*.log.rmt`

`aaa/bbb/*.log.met`

`aaa/bbb/*.log.rmet`

`aaa/bbb/*.log.cid`

`aaa/bbb/*.log.cnk`

`proxy.log`

`cluster.log`

`schedule.log`

`activity.log`

`transitions.log`

`difference.log`

`multicast.log`

`migrate.tree/old` and `migrate.tree/new`