projects/datalad/bugs-done/Too difficult if not impossible to explicitly add/keep file under git (not annex) in v6 without employing .gitattributesyohhttp://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/git-annexikiwiki2023-01-05T17:30:31Zcomment 1http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_1_045f90b5693de55958c6ad1b825cbc9e/joey2023-01-05T17:30:31Z2017-01-30T16:07:31Z
<blockquote><p>Even if -c annex.largefiles=nothing is used with git add, then git commit commits file into annex.</p></blockquote>
<p>I don't reproduce that problem; as long as the file is staged with that -c,
git commit commits what's staged.</p>
comment 2http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_2_ead7e348547da0598a82be8d064c0c08/joey2023-01-05T17:30:31Z2017-01-30T16:11:04Z
<p>What's really going on with <code>git status</code>:</p>
<ul>
<li>It runs <code>git annex smudge --clean</code></li>
<li>That ingests the file into the annex, but does not change what's staged at all.</li>
<li>When you run <code>git diff</code>, it again cleans the file and displays
the difference between what is staged (the unannexed file) and the
cleaned version (the annexed file).</li>
<li>Notice that if you run <code>git annex add file12 -c annex.largefiles=nothing</code>
after this, the file gets staged in non-annexed file, as you'd expect.</li>
</ul>
<p>So the only problem I see here is <code>git annex smudge --clean</code>
is ingesting a file into the annex when <code>git status</code> is not going to
update the working tree to use the pointer to that file. But the clean
hook interface doesn't provide it any way to know why git is asking for a
file to be cleaned, so it has to always do that.</p>
<p>If this is a small file suitable for being checked into git, the overhead of
having a copy of it in the annex shouldn't matter much, and annex.thin
will even make that copy be a hard link.</p>
<p>I suppose one way this could be improved is for <code>git annex smudge --clean</code>
to check if a file was checked into git as a non-annexed file before,
and then avoid cleaning it at all. But then if someone had a non-annexed
file and it got big and they wanted to add it annexed, such a change
would cause a problem..</p>
<p>Or git's smudge/clean interface could be improved so that the clean
filter can know why it's being called, and so avoid ingesting files
unnecessarily. IIRC my patch to improve the git interface did do that, but
unfortunately it's stalled getting accepted into git.</p>
comment 3http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_3_ff0746f08600fb9729b8e1773317a52d/yarikoptic2023-01-05T17:30:31Z2017-02-06T20:46:29Z
<p>thought about it...
I guess it would then be a limitation, sorry -- a feature, of v6 mode indeed that all instructions on what files should go into annex vs git should always reside within .gitattributes, to be in effect. Although very powerful, might be limiting in case of a naive user. Any chance you would pick up on pushing your suggested changes forward (i.e. keep nagging like I do I guess) within git? <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /></p>
comment 4http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_4_598622df322c56da71320d1b15e0348f/benjamin.poldrack2023-01-05T17:30:31Z2017-02-09T09:05:34Z
<p>I want to point out an issue, that seems to be unnoticed:
While we (datalad) might be able to deal with this effect regarding git status in terms of space, it has another consequence. As the scripts are showing it also leads to the file marked as modified, if it is rewritten (very same content) and therefore leaves the repository dirty. To me this a pretty weird effect and I don't understand yet what this is caused by.</p>
comment 5http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_5_9707d9dda7ebb2c94c71f1ea2f99064d/db48x2023-01-05T17:30:31Z2017-02-14T21:10:13Z
I've got a repository that has a small file that gets updated regularly. I've given up trying to keep it unannexed and just started unlocking it every time I run the automated script that updates it.
comment 6http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_6_56c8ddb604b54b8a5cfd077225971582/joey2023-01-05T17:30:31Z2017-03-02T17:58:22Z
<p>The several comments after my analysis above seem to still be under the
impression that it's somehow hard to keep files stored in git, not annexed
in v6 mode. But as my analysis shows, that is not what is happening at all
in the test case given.</p>
<p>If you have a situation there that problem occurs, please share it.
Otherwise, I will close this bug report.</p>
comment 7http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_7_6b2a763f3d07f0cdc035008638e08194/benjamin.poldrack2023-01-05T17:30:31Z2017-03-27T09:03:10Z
<p>You are right, if we stage that file again with annex.largefiles option, it stays in git.
But still keeping a file in git is troublesome. For example: Take a repository with an annexed file and a file in git, clone it and the call git annex init --version=6 on the clone.
This will lead to a dirty repository, where git status as well as git annex status are stating, that the file in git has unstaged modifications.
I'm not sure, whether this is actually related but I guess it is. Causing fresh clones to be dirty is at least a strange consequence for having files directly in git.</p>
<p><a href="http://git-annex.branchable.com/users/ben/">ben</a></p>
@ben: what?http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_8_5aa63bf0d8e4145974702a86b36bd435/joey2023-01-05T17:30:31Z2017-04-07T16:37:31Z
<p>I don't know what you mean with <code>git annex init</code> in a v6 repo somehow
doing something to a non-annexed file. That would be extemely surprising
if it happened, and it does not happen when I try to follow your
instructions.</p>
<pre><code>joey@darkstar:~/tmp/v62>git annex init --version=6
init (merging origin/git-annex into git-annex...)
(recording state in git...)
ok
(recording state in git...)
joey@darkstar:~/tmp/v62>ls
x y@
joey@darkstar:~/tmp/v62>git annex get y
get y (from origin...) (checksum...) ok
(recording state in git...)
joey@darkstar:~/tmp/v62>git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working tree clean
</code></pre>
@joey: Sorry ...http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_9_5a22839b8dc11965a879dd2654bd5d60/benjamin.poldrack2023-01-05T17:30:31Z2017-09-14T12:00:09Z
<p>... I somehow managed to miss your response. Now, since a somewhat related topic is emerging again with datalad, I looked into this one again.
I reproduced, what I described before, but I noticed that it involves kind of an implicit upgrade from a V5 to V6 repository.</p>
<p>First, let's have v5 repo with a file in git and a file in annex:</p>
<pre><code>ben@tree /tmp % mkdir origin
ben@tree /tmp % cd origin
ben@tree /tmp/origin % git init
Initialized empty Git repository in /tmp/origin/.git/
ben@tree /tmp/origin % git annex init
init ok
(recording state in git...)
ben@tree /tmp/origin % echo some > some
ben@tree /tmp/origin % git add some
ben@tree /tmp/origin % echo something different > annex
ben@tree /tmp/origin % git annex add annex
add annex ok
(recording state in git...)
ben@tree /tmp/origin % git commit -m "initial"
[master (root-commit) 8b96354] initial
2 files changed, 2 insertions(+)
create mode 120000 annex
create mode 100644 some
ben@tree /tmp/origin % ll
total 376
drwxr-xr-x 3 ben ben 4096 Sep 14 13:34 .
drwxrwxrwt 24 root root 364544 Sep 14 13:33 ..
lrwxrwxrwx 1 ben ben 180 Sep 14 13:34 annex -> .git/annex/objects/g7/4P/SHA256E-s20--b6105173f468fc7afa866aa469220cd56e5200db590be89922239a38631379c9/SHA256E-s20--b6105173f468fc7afa866aa469220cd56e5200db590be89922239a38631379c9
drwxr-xr-x 9 ben ben 4096 Sep 14 13:34 .git
-rw-r--r-- 1 ben ben 5 Sep 14 13:34 some
ben@tree /tmp/origin % git ls-files
annex
some
ben@tree /tmp/origin % git annex find
annex
</code></pre>
<p>Now, clone this repository:</p>
<pre><code>ben@tree /tmp/origin % cd ..
ben@tree /tmp % git clone origin cloned
Cloning into 'cloned'...
done.
ben@tree /tmp % cd cloned
</code></pre>
<p>And annex-init as a v6 repository:</p>
<pre><code>ben@tree /tmp/cloned % git annex init --version=6
init (merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
ok
(recording state in git...)
ben@tree /tmp/cloned % git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: some
no changes added to commit (use "git add" and/or "git commit -a")
</code></pre>
<p>This kind of "implicit" upgrade might not be a common use case, but the result seems to be a bit weird nonetheless.</p>
comment 10http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_10_5eb0622b326a8094b06f5f0de627e288/joey2023-01-05T17:30:31Z2018-08-09T19:52:50Z
<p>Ben, I can reproduce that, but the file appearing modified in git status
is a known problem documented in <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=projects%2Fdatalad%2Fbugs-done%2FToo_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes%2Fcomment_10_5eb0622b326a8094b06f5f0de627e288&page=todo%2Fsmudge" rel="nofollow">?</a>smudge</span>. It's one
of the primary reasons that v6 mode remains experiemental.</p>
<p>While <code>git commit -a</code> in that clone does cause the file to be converted
from git to annex, touching the file and committing has the same effect. If
you want to juggle annexed and non-annexed files in a v6 repository without
letting annex.largefiles tell git-annex what to do, you have to manually
tell it what to do every time the file is staged. When you <code>git commit -a</code>,
you stage the file and so you need to include <code>-c annex.largefiles=nothing</code>
to keep it from transitioning to the annex.</p>
<p>It think it might make sense to get v6 working to the point that it's
non-experimental before worrying about such a marginal edge case as this.</p>
comment 11http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_11_a3b5e2047185bbd3972ada07ddb79172/joey2023-01-05T17:30:31Z2018-08-27T17:20:19Z
<p>Actually the "some" in-git file showing as modified in status is not the
same as the git-annex get/drop showing modified in status problem. I've
fixed the latter and the former still happens.</p>
<p>What's happening is, git runs the clean filter, I think because this
is a fresh clone and it's not cleaned it yet. That looks at
annex.largefiles (lack of) configuration and concludes this file belongs in
the annex, so it ingests it. Rest follows the same as what I described in
comment #2.</p>
<p>So yeah, that's a real problem, you clone a repo and there are suddenly
changes that mess up the painstakingly set up in-annex/in-git division.
(Note that it does not need to involve an upgrade.)</p>
<p>Only fix I can imagine is my old idea:</p>
<blockquote><p>I suppose one way this could be improved is for git annex smudge --clean to
check if a file was checked into git as a non-annexed file before, and then
avoid cleaning it at all. But then if someone had a non-annexed file and it got
big and they wanted to add it annexed, such a change would cause a problem..</p></blockquote>
<p>I suppose we could get around that problem with a new git-annex command
that converts a non-annexed file to an annexed file. Without a command,
this would also work: mv the file to a temp name, followed by git-annex add,
and then git mv back.</p>
comment 12http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_12_db64ae67b8aa974bd6d00aad005058ac/joey2023-01-05T17:30:31Z2018-08-27T18:08:24Z
<p>Hmm, I think if annex.largefiles is configured, it should be honored. So,
the ways to convert documented on
<a href="https://git-annex.branchable.com/tips/largefiles/">https://git-annex.branchable.com/tips/largefiles/</a> will work
with only a minor modification.</p>
<p>If annex.largefiles is not configured, it can check if the file was annexed
or not before, and maintain the status quo.</p>
<p>At least to start with I am only going to do it in the clean filter, so
<code>git annex add</code> behavior won't change, it default to annexing. It may make
sense to make <code>git annex add</code> consistent with <code>git add</code>, but that would
also need to affect v5 repositories, probably, and I don't want to entangle
a v6 bug fix with a v5 behavior change. Also, the risk of a
<code>git annex add</code> with the wrong annex.largefiles temp setting
is much smaller than the risk of forgetting to temp set annex.largefiles
when running <code>git commit -a</code>. Also, I seem to remember another todo item
discussing making this change to <code>git annex add</code> and would need to revisit
the thinking in that.</p>
<p>Note that a .gitattributes file might only configure largefiles for eg,
"*.c" and not for other files, thus implicitly accepting the default
largefiles for the rest. Should then <code>Makefile</code> be treated as having
largefiles configured or not? I lean toward treating it as not configured,
because then when the user temporarily overrides largefiles to add <code>Makefile</code>
to git, a modification won't accidentially go to the annex.</p>
comment 13http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_13_5f11262639d60ef63a44de09b191e0b6/joey2023-01-05T17:30:31Z2018-08-28T14:02:37Z
<p>However, that leaves the case where .gitattributes configures
annex.largefiles, but that's been overridden for a file to add it to git,
and then the repo is cloned and initted with --version=6 (or upgraded).</p>
<p>Turns out that calling git status before enabling the smudge filter
prevents git from getting confused about the file being modified in this
case.</p>
<p>In the fresh clone, git has not populated the index with stat info
yet, and so it later runs the clean filter on the file, and that
respects the largefiles configuration, so the way the file is
stored in git is not taken into account.</p>
<p>Worked around this by adding a <code>git status</code> call to the v6
initialization/upgrade.</p>
git annex init --version=6 leaves repo dirtyhttp://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_14_cfab68df294195498fe71daa376e9f68/michael.hanke2023-01-05T17:30:31Z2018-11-18T17:00:21Z
<p>With this command in an empty directory, I can more or less reliably cause the issue to happen (again)</p>
<p><code>sudo rm -rf .git .datalad .gitattributes * ; git clone https://github.com/datalad/testrepo--minimalds.git . && git annex init --version=6; git status</code></p>
<p>using 6.20181011+git124-g94aa0e2f6-1~ndall+1 or 6.20180913-1</p>
<p>Full story is here: <a href="https://github.com/datalad/datalad/issues/2998">#datalad/2998</a></p>
comment 15http://git-annex.branchable.com/projects/datalad/bugs-done/Too_difficult_if_not_impossible_to_explicitly_add__47__keep_file_under_git___40__not_annex__41___in_v6_without_employing_.gitattributes/comment_15_e42f03c3e880dcd71d57bc2f6292d95d/joey2023-01-05T17:30:31Z2018-11-19T17:17:07Z
<p>6.20180913-1 is from before the problem was fixed.</p>
<p>I don't reproduce the problem with your script, using the current version of git-annex.</p>