If an unlocked (as known to annex) file is dropped locally, it is still present on the file system as a regular file. So git-annex unaware tool could happily append to it without realizing that it is changing a file which contains no data, but rather a git link of git-annex. Then git commit
would silently (!!!) commit such a change.
here is a reproducer
#!/bin/bash
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
set -eu
git init
git annex init
set -x
git annex addurl --file 123 http://onerussian.com/tmp/123
git annex unlock 123
git commit -m 'commit 123 unlocked' 123
git annex drop 123
cat 123
# git annex knows that the content is gone!
git annex list
echo "more crap" >> 123
git commit -m 'Added crap' 123
# how probable it is that the user DOES want gitlink on top of that file?
cat 123
git show
which with git-annex 10.20220127+git47-g9f9b1488e-1~ndall+1
produces
...
+ git annex list
here
|web
||bittorrent
|||
_X_ 123
+ echo 'more crap'
+ git commit -m 'Added crap' 123
[master fdcea88] Added crap
1 file changed, 1 insertion(+)
+ cat 123
/annex/objects/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b
more crap
+ git show
commit fdcea88dcfcaf823eebfe78734f30b81531240a8 (HEAD -> master)
Author: Yaroslav Halchenko <debian@onerussian.com>
Date: Fri Feb 18 14:43:32 2022 -0500
Added crap
diff --git a/123 b/123
index 1c0106d..ef5ca34 100644
--- a/123
+++ b/123
@@ -1 +1,2 @@
/annex/objects/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b
+more crap
although could be considered a user error, I feel that also git annex could add a guard that while smudging if file was not locally present, beginning of the content is the previous git link, something went awry and at least issuing some warning (if possible) could be due and help prevent some data loss where expected to grow file would be trimmed and previous content possibly dropped.
may be situation is even more "dire" because git-annex still considers this file "annexed" (according to git annex list
) although not present locally.
Actually, the format of pointer files allows additional content to be present after the key and a newline.
That is to allow for future expansion. Note that git-lfs does the same and IIRC puts something additional in there, perhaps including a comment like "This file's content is not present; this is only a pointer."
So, appending to the link file keeps it a working link, just a link with more crap at the end:
So how is this data loss? The annexed content is still available; git-annex get will still work. The content you appended to the link file didn't go where you intended it to, but it is checked into git too.
If git-annex warned about this when smudging a file, it would still get added to git; it can't prevent that. And the warning would mess with any intentional use of this.
hm, well, indeed - data will not be lost per se. But data will "spread out" (some under git-annex, some as some ignored content in the git link file), and indeed the history of modifications would still be there in git history thus keeping it possible to reconstruct the ultimate intended file content (would be especially fun exercise for a user if there were modifications to both git-annex'ed version of the file and
git
committed git linked text portion of the file). I could come up with some obscure scenario (force dropped annexed content of some prior state of annexed file because I have a newer one) where recovering ultimate original order of such "spread outs" would make it impossible though.Is there really any intentional use of this? if not, then IMHO UX would be better if such use would be prevented (error out) rather than allowed since could lead to the confusion and requiring quite an archaeological expedition to recover data in originally intended "all annexed in one file". If there is an intended use, then I guess indeed there is no way for users to shoot themselves in the foot?
If it errored, git would ignore the failure and still add the file content to git.
One thing that could be done is to come up with some kind of format for the lines after the annex link in the pointer file, to be used in whatever future expansion might happen. Then it could warn if there is data there not matching that format. And then add the file to the annex, to at least keep that data out of the git repository.
I've now specified a format in pointer file, which is designed to allow detecting accidental appends.
And git-annex will now treat a pointer file that has been appeneded to as not a pointer file any longer.
So, for example:
Since the file is not a valid pointer file after being appended to, git add does what it would do with any file, in this case adding the content to the annex.
So at least it keeps the possibly large appeneded content out of git now. I think that's the most important thing. Detecting and warning about pointer files that are not valid due to appends should be easy from here.
I think this is as far as git-annex can do toward preventing foot shooting here.