Please describe the problem.
Even if -c annex.largefiles=nothing is used with git add, then git commit commits file into annex.
This script demonstrates initial finding: http://www.onerussian.com/tmp/ga.sh
Even if we use -c annex.largefiles=nothing with both add and commit, subsequent git status migrates that file into annex (which is even more weird)
Adjusted script demonstrates it: http://www.onerussian.com/tmp/ga-2.sh
What steps will reproduce the problem?
see above
What version of git-annex are you using? On what operating system?
6.20170101+gitg93d69b1-1~ndall+1
Please provide any additional information below.
For completeness here is the output of the 2nd script run:
$> /tmp/ga-2.sh
++ mktemp -d
+ d=/home/yoh/.tmp/tmp.WXTv2cULD0
+ echo 'directory: /home/yoh/.tmp/tmp.WXTv2cULD0'
directory: /home/yoh/.tmp/tmp.WXTv2cULD0
+ cd /home/yoh/.tmp/tmp.WXTv2cULD0
+ git init
Initialized empty Git repository in /tmp/tmp.WXTv2cULD0/.git/
+ git annex init --version=6
init ok
(recording state in git...)
+ sed -i -e 's,pre-commit ,pre-commit --debug ,g' .git/hooks/pre-commit
+ echo 'I: creating a file'
I: creating a file
+ echo whatever
+ git -c annex.largefiles=nothing add file5
+ git annex lookupkey file5
+ echo 'not in annex as it should'
not in annex as it should
+ git annex find
+ ls -lR .git/annex/objects
ls: cannot access '.git/annex/objects': No such file or directory
+ :
+ git -c annex.largefiles=nothing commit -m sdf
[2017-01-27 09:43:27.569648245] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","diff","--cached","--name-only","-z","--diff-filter=ACMRT","--","."]
[2017-01-27 09:43:27.576498829] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2017-01-27 09:43:27.581039152] process done ExitSuccess
[2017-01-27 09:43:27.581134039] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2017-01-27 09:43:27.585093046] process done ExitFailure 1
[master (root-commit) 9ec3fe6] sdf
1 file changed, 1 insertion(+)
create mode 100644 file5
+ ls -lR .git/annex/objects
ls: cannot access '.git/annex/objects': No such file or directory
+ :
+ echo 'I: before git status'
I: before git status
+ ls -lR .git/annex/objects
ls: cannot access '.git/annex/objects': No such file or directory
+ :
+ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: file5
no changes added to commit (use "git add" and/or "git commit -a")
+ echo 'I: after git status'
I: after git status
+ ls -lR .git/annex/objects
.git/annex/objects:
total 0
drwx------ 3 yoh yoh 60 Jan 27 09:43 XF
.git/annex/objects/XF:
total 0
drwx------ 3 yoh yoh 60 Jan 27 09:43 pp
.git/annex/objects/XF/pp:
total 0
dr-x------ 2 yoh yoh 60 Jan 27 09:43 SHA256E-s9--cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411
.git/annex/objects/XF/pp/SHA256E-s9--cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411:
total 4
-rw------- 1 yoh yoh 9 Jan 27 09:43 SHA256E-s9--cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411
+ git diff
diff --git a/file5 b/file5
index 982793c..8fdffc0 100644
--- a/file5
+++ b/file5
@@ -1 +1 @@
-whatever
+/annex/objects/SHA256E-s9--cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
yeap
done; clean filter defaults to preserving git/annex state of file. --Joey
I don't reproduce that problem; as long as the file is staged with that -c, git commit commits what's staged.
What's really going on with
git status
:git annex smudge --clean
git diff
, it again cleans the file and displays the difference between what is staged (the unannexed file) and the cleaned version (the annexed file).git annex add file12 -c annex.largefiles=nothing
after this, the file gets staged in non-annexed file, as you'd expect.So the only problem I see here is
git annex smudge --clean
is ingesting a file into the annex whengit status
is not going to update the working tree to use the pointer to that file. But the clean hook interface doesn't provide it any way to know why git is asking for a file to be cleaned, so it has to always do that.If this is a small file suitable for being checked into git, the overhead of having a copy of it in the annex shouldn't matter much, and annex.thin will even make that copy be a hard link.
I suppose one way this could be improved is for
git annex smudge --clean
to check if a file was checked into git as a non-annexed file before, and then avoid cleaning it at all. But then if someone had a non-annexed file and it got big and they wanted to add it annexed, such a change would cause a problem..Or git's smudge/clean interface could be improved so that the clean filter can know why it's being called, and so avoid ingesting files unnecessarily. IIRC my patch to improve the git interface did do that, but unfortunately it's stalled getting accepted into git.
thought about it... I guess it would then be a limitation, sorry -- a feature, of v6 mode indeed that all instructions on what files should go into annex vs git should always reside within .gitattributes, to be in effect. Although very powerful, might be limiting in case of a naive user. Any chance you would pick up on pushing your suggested changes forward (i.e. keep nagging like I do I guess) within git?
I want to point out an issue, that seems to be unnoticed: While we (datalad) might be able to deal with this effect regarding git status in terms of space, it has another consequence. As the scripts are showing it also leads to the file marked as modified, if it is rewritten (very same content) and therefore leaves the repository dirty. To me this a pretty weird effect and I don't understand yet what this is caused by.
The several comments after my analysis above seem to still be under the impression that it's somehow hard to keep files stored in git, not annexed in v6 mode. But as my analysis shows, that is not what is happening at all in the test case given.
If you have a situation there that problem occurs, please share it. Otherwise, I will close this bug report.
You are right, if we stage that file again with annex.largefiles option, it stays in git. But still keeping a file in git is troublesome. For example: Take a repository with an annexed file and a file in git, clone it and the call git annex init --version=6 on the clone. This will lead to a dirty repository, where git status as well as git annex status are stating, that the file in git has unstaged modifications. I'm not sure, whether this is actually related but I guess it is. Causing fresh clones to be dirty is at least a strange consequence for having files directly in git.
ben
I don't know what you mean with
git annex init
in a v6 repo somehow doing something to a non-annexed file. That would be extemely surprising if it happened, and it does not happen when I try to follow your instructions.... I somehow managed to miss your response. Now, since a somewhat related topic is emerging again with datalad, I looked into this one again. I reproduced, what I described before, but I noticed that it involves kind of an implicit upgrade from a V5 to V6 repository.
First, let's have v5 repo with a file in git and a file in annex:
Now, clone this repository:
And annex-init as a v6 repository:
This kind of "implicit" upgrade might not be a common use case, but the result seems to be a bit weird nonetheless.
Ben, I can reproduce that, but the file appearing modified in git status is a known problem documented in ?smudge. It's one of the primary reasons that v6 mode remains experiemental.
While
git commit -a
in that clone does cause the file to be converted from git to annex, touching the file and committing has the same effect. If you want to juggle annexed and non-annexed files in a v6 repository without letting annex.largefiles tell git-annex what to do, you have to manually tell it what to do every time the file is staged. When yougit commit -a
, you stage the file and so you need to include-c annex.largefiles=nothing
to keep it from transitioning to the annex.It think it might make sense to get v6 working to the point that it's non-experimental before worrying about such a marginal edge case as this.
Actually the "some" in-git file showing as modified in status is not the same as the git-annex get/drop showing modified in status problem. I've fixed the latter and the former still happens.
What's happening is, git runs the clean filter, I think because this is a fresh clone and it's not cleaned it yet. That looks at annex.largefiles (lack of) configuration and concludes this file belongs in the annex, so it ingests it. Rest follows the same as what I described in comment #2.
So yeah, that's a real problem, you clone a repo and there are suddenly changes that mess up the painstakingly set up in-annex/in-git division. (Note that it does not need to involve an upgrade.)
Only fix I can imagine is my old idea:
I suppose we could get around that problem with a new git-annex command that converts a non-annexed file to an annexed file. Without a command, this would also work: mv the file to a temp name, followed by git-annex add, and then git mv back.
Hmm, I think if annex.largefiles is configured, it should be honored. So, the ways to convert documented on https://git-annex.branchable.com/tips/largefiles/ will work with only a minor modification.
If annex.largefiles is not configured, it can check if the file was annexed or not before, and maintain the status quo.
At least to start with I am only going to do it in the clean filter, so
git annex add
behavior won't change, it default to annexing. It may make sense to makegit annex add
consistent withgit add
, but that would also need to affect v5 repositories, probably, and I don't want to entangle a v6 bug fix with a v5 behavior change. Also, the risk of agit annex add
with the wrong annex.largefiles temp setting is much smaller than the risk of forgetting to temp set annex.largefiles when runninggit commit -a
. Also, I seem to remember another todo item discussing making this change togit annex add
and would need to revisit the thinking in that.Note that a .gitattributes file might only configure largefiles for eg, "*.c" and not for other files, thus implicitly accepting the default largefiles for the rest. Should then
Makefile
be treated as having largefiles configured or not? I lean toward treating it as not configured, because then when the user temporarily overrides largefiles to addMakefile
to git, a modification won't accidentially go to the annex.However, that leaves the case where .gitattributes configures annex.largefiles, but that's been overridden for a file to add it to git, and then the repo is cloned and initted with --version=6 (or upgraded).
Turns out that calling git status before enabling the smudge filter prevents git from getting confused about the file being modified in this case.
In the fresh clone, git has not populated the index with stat info yet, and so it later runs the clean filter on the file, and that respects the largefiles configuration, so the way the file is stored in git is not taken into account.
Worked around this by adding a
git status
call to the v6 initialization/upgrade.With this command in an empty directory, I can more or less reliably cause the issue to happen (again)
sudo rm -rf .git .datalad .gitattributes * ; git clone https://github.com/datalad/testrepo--minimalds.git . && git annex init --version=6; git status
using 6.20181011+git124-g94aa0e2f6-1~ndall+1 or 6.20180913-1
Full story is here: #datalad/2998
6.20180913-1 is from before the problem was fixed.
I don't reproduce the problem with your script, using the current version of git-annex.