internalsgit-annexhttp://git-annex.branchable.com/internals/git-annexikiwiki2022-12-20T19:18:17Ztmp missinghttp://git-annex.branchable.com/internals/comment_1_4b8ed353dca4f484b3b6eb463fa02fd8/stoile2013-11-27T22:47:37Z2013-11-16T12:16:48Z
There's no information about .git/annex/tmp here.
comment 2http://git-annex.branchable.com/internals/comment_2_c19232d5cc4976c2e5b014aef6e8d9ec/joeyh.name2013-11-27T22:47:37Z2013-11-16T17:23:02Z
.git/annex/tmp is not very interesting. It's a temporary file directory. When transferring a key's content, git-annex uses a stable filename, which allows resuming interrupted downloads, or cleaning up aborted downloads with <code>git annex unused</code>.
comment 3http://git-annex.branchable.com/internals/comment_3_5a26ee5aab274f321a4ea6f8527f53bd/zardoz2014-09-15T10:34:36Z2014-09-15T10:34:36Z
<p>Some documentation that would be nice having added in the appropriate places:</p>
<ul>
<li><p>Exactly how is the index file being used? Is this just a copy of the git index file that we would get when checking out the git-annex branch? Why is the file kept around? Which sort of operations are done on the index, e.g. when doing git annex add?</p></li>
<li><p>For all time-stamped data-structures: Exactly which significance does the time-stamp have? E.g., for uuid.log, is this the date the name was changed?</p></li>
</ul>
.git/annex/tmp third-party use?http://git-annex.branchable.com/internals/comment_4_81293b180fb09105ec158fdfef73d249/anarcat [id.koumbit.net]2015-06-09T20:21:39Z2015-06-09T20:21:39Z
can <code>.git/annex/tmp</code> be used by third party software to import stuff in git-annex? the idea here would be to accept uploads from a web form in <code>.git/annex/tmp</code> then move it into place in the proper location once the upload is complete (then do a git-annex-add or let the assistant import it). --<a href="http://git-annex.branchable.com/users/anarcat/">anarcat</a>
comment 1http://git-annex.branchable.com/internals/comment_5_354012b6a9ac11160eb926234d38051f/joey2015-06-09T21:54:08Z2015-06-09T20:25:03Z
<p>It's ok to put files in .git/annex/tmp if they're formatted as git-annex
key filenames. Of course, you should avoid overwriting the content of files
already there.</p>
<p>Files not formatted as keys should be kept out of .git/annex/tmp; it's ok
to put them in .git/annex/misctmp.</p>
.git/annex/misctmp very largehttp://git-annex.branchable.com/internals/comment_7_7e40f744f9ac7f0403df9d1a2162a516/arseny-n2018-01-17T13:04:52Z2018-01-17T13:04:52Z
<p>Why is .git/annex/misctmp so large ? Currently I use git annex to manage pytorch models,
basically I have a large amount (1500 folder) of 4 Kilobytes files, some files are are bigger,
misctmp occupies 6.2 GB, is it ok ?</p>
<p>PS. Sorry, if I write this here but failed to post to the forum.</p>
comment 7http://git-annex.branchable.com/internals/comment_7_9c82a2878f3feb1b2a95662ed25b234b/joey2018-02-22T16:59:55Z2018-02-22T16:56:13Z
<p>@arseny-n the misctemp directory does not normally contain anything, or
only temp files in use by the currently running git-annex process for a
short amount of time. The only way I know of that it can get files piled up
in it is when you kill the git-annex process while it's using such a file.</p>
<p>It's always safe to delete the files in misctemp as long as git-annex is
not running. Also, the names of the files should give a pretty good clue
about what git-annex was using the file for. For example "jlog" files are
used for staging the journal.</p>
comment 8http://git-annex.branchable.com/internals/comment_8_9dccdd3a9556ceef54e318cd5c8a50ad/Ilya_Shlyakhter2018-09-19T16:07:44Z2018-09-19T16:07:44Z
<p>"Each subdirectory has the name of a key in one of the key-value backends. The file inside also has the name of the key." -- is it necessary for the file inside to also have the name of the key? Repeating the already long key name leads to very long symlink targets. Could the file inside just be 'f.txt' (or whatever the extension is)?</p>
<p>Also, the terms "key" and "name of the key" are used in various places; are these the same thing?</p>
representing unlocked state of fileshttp://git-annex.branchable.com/internals/comment_9_40442b012886ad698f448c262f0d7f4c/Ilya_Shlyakhter2020-06-17T01:18:32Z2019-09-19T18:02:31Z
Is the locked/unlocked state of a file represented somewhere? Or does git-annex just assume that any file whose contents starts with /annex/objects/ is a pointer file?
duplicate objects?http://git-annex.branchable.com/internals/comment_10_c4298babd96b2596bd4f6ad828212c92/atrent2020-06-17T01:18:32Z2019-11-30T14:04:17Z
<p>Do I understand correctly that in .git/annex/objects dir there should be no duplicates?
Here follows a run of 'rdfind' done in the objects dir:</p>
<pre><code>$ rdfind .
Now scanning ".", found 12874 files.
Now have 12874 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 75579281486 bytes or 70 GiB
Removed 8376 files due to unique sizes from list.4498 files left.
Now eliminating candidates based on first bytes:removed 68 files from list.4430 files left.
Now eliminating candidates based on last bytes:removed 66 files from list.4364 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.4364 files left.
It seems like you have 4364 files that are not unique
Totally, 10 GiB can be reduced.
Now making results file results.txt
</code></pre>
<p>And here is an example pair of dupes (excerpt from the abovementioned 'results.txt'):</p>
<pre><code>DUPTYPE_FIRST_OCCURRENCE 2073 3 86558 26 21057567 1 ./53/zv/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.jpg/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.jpg
DUPTYPE_WITHIN_SAME_TREE -2073 3 86558 26 1080608 1 ./7w/w2/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.56.jpeg/SHA256E-s86558--e79a0891bb94fc9212ce2f28178fe84591c5fb24c07b5239d367099118e12ede.56.jpeg
</code></pre>
<p>Any clues?</p>
<p>Thank you</p>
same contents with different keyshttp://git-annex.branchable.com/internals/comment_11_9758bb3a17f63b4dcf51742ea482dbe9/Ilya_Shlyakhter2020-06-17T01:18:32Z2019-11-30T16:51:58Z
@atrent -- some <a href="http://git-annex.branchable.com/backends/">backends</a> (like SHA256E) base the key not just on object contents, but also on part of its filename (the extension). So the same content can exist with two different keys. In your example, the same contents exists in one file ending with .jpg and in another ending with .56.jpeg . (This is done to give the annexed contents the same extension as the original file had before annexing, to avoid confusing some programs). There are also backends like WORM and URL, not based on checksums, that could lead to different keys with same contents. There could also be same contents added under different backends (see also <a href="http://git-annex.branchable.com/git-annex-migrate/"><code>git-annex-migrate</code></a>). Finally, there is the theoretical possibility of hash collisions.
no collisionshttp://git-annex.branchable.com/internals/comment_12_f0325cefa5cd53a5a897046606137cef/atrent2020-06-17T01:18:32Z2019-11-30T20:37:00Z
<p>I can confirm that these are not collisions: these identical files are the same photos with different names, shame on Dropbox syncing from my smartphone. I was actually hoping to dedupe through git-annex <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";-)" /></p>
<p>Some more questions/suggestions/conversation-starters:</p>
<ul>
<li><p>I suppose I can dedup them with rdfind (i.e., hardlinking identical files), do you foresee any side effects?</p></li>
<li><p>may I change the hash function of git-annex to something not depending on filenames? (I suppose so, I'll have a look at the docs)</p></li>
<li><p>if I can change the hash function can I regenerate the whole annex without re-creating it? (again I'll have a look at docs)</p></li>
</ul>
<p>Thanks</p>
comment 13http://git-annex.branchable.com/internals/comment_13_e45b6fa035a30703618448a0f764f935/Ilya_Shlyakhter2020-06-17T01:18:32Z2019-11-30T21:11:53Z
<a href="http://git-annex.branchable.com/git-annex-migrate/">git-annex-migrate</a> to a backend not ending in E (e.g. SHA256 not SHA256E), then <a href="http://git-annex.branchable.com/git-annex-unused/">git-annex-unused</a> to drop the old keys.
hardlinking identical files in annex may break invariantshttp://git-annex.branchable.com/internals/comment_14_3f62751c2dd041f4ead1c6580ea5eec1/Ilya_Shlyakhter2020-06-17T01:18:32Z2019-11-30T21:36:38Z
<p>P.S. Re: hardlinking identical files -- git-annex <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=internals%2Fcomment_14_3f62751c2dd041f4ead1c6580ea5eec1&page=todo%2Finode_based_clean_filter_for_less_surprising_git_add" rel="nofollow">?</a>keeps track of inodes</span> where contents is stored, so deleting a file might make that info stale. Also, dropping one key will drop another key's contents without updating <a href="http://git-annex.branchable.com/location_tracking/">location tracking</a> info. And dropping then getting files would lead to two separate copies again. So I wouldn't recommend that.</p>
<p>See also <a href="http://git-annex.branchable.com/tips/local_caching_of_annexed_files/">local caching of annexed files</a>.</p>
migrating...http://git-annex.branchable.com/internals/comment_15_c3d12d14e4d044f39829c5d92f523655/atrent2020-06-17T01:18:32Z2019-11-30T22:30:06Z
I'm git-annex-migrating (to SHA256) now, thank you for all suggestions!
Re: representing unlocked state of fileshttp://git-annex.branchable.com/internals/comment_16_2455c898d6c77a5437a2c1532144bb8a/joey2020-06-17T01:18:32Z2020-02-20T20:23:41Z
<p>Starting with "/annex/objects" is not enough, the remainder of
the first line of the file has to parse as a git-annex key
for it to be a pointer file.</p>
why othertmp to be on the same file system?http://git-annex.branchable.com/internals/comment_17_df13b7e66963a6d2673e49f52afb978a/yarikoptic2022-12-13T14:15:28Z2022-12-13T14:15:28Z
<p>Joey, you said</p>
<blockquote><p>.git/annex/othertmp has to be on the same filesystem as the work tree and git repository.</p></blockquote>
<p>is that for atomic/quick renames important for adjusted branch unlocked mode? asking because in datalad we have that "wreckless" mode where we symlink an entire <code>.git/annex</code> from original clone to e.g. have quick throw away clone to access the data and possibly even without modifying state of any annexed key.</p>
<p>Or how else are we stepping on the shovel here?</p>
re: why othertmp to be on the same file system?http://git-annex.branchable.com/internals/comment_18_1adce7945940b9c384c2383261388dd9/joey2022-12-20T19:18:17Z2022-12-20T18:39:35Z
<p>I've audited the code and the only place I could find where it did not work
to have othertmp on a different filesystem is in the bittorrent special
remote when it downloads a torrent file. But that also failed when
<code>.git/annex/tmp</code> was on a different filesystem! (Since it was moving between
the two directories.) I've fixed that.</p>
<p>It's still best to keep things on the same filesystem because
cross-filesystem moves can be expensive and it sometimes falls back to less
ideal behavior in other ways too when operating across filesystems. Also
of course, you avoid being the one who gets to find and report breakage
like the above..</p>