Please describe the problem.
the goal was to have files unlocked (no symlinks) and gain "efficient" clone which would not consume more space through hardlinks. But if mode is thin in the clone - we get a copy within the clone
What steps will reproduce the problem?
#!/bin/bash
export PS4='> '
set -x
set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
git annex version | head -n 1
(
mkdir src
cd src
git init
git annex init
git commit --allow-empty -m empty
git annex adjust --unlock
git config annex.thin true
touch 123
git annex add 123
git commit -m 123 123
)
git clone --shared src dest
(
cd dest
# without it it would stall in "get" call
# https://git-annex.branchable.com/bugs/get_is_stuck_unless_a_clone_was_previously_explicitly___34__annex_init__34__ed/
git annex init
git config annex.thin true
git annex get 123
)
: are inodes the same?
ls -lLi {dest,src}/{123,.git/annex/objects/*/*/*/*}
Please provide any additional information below.
$> bash check-annex-thin-hardlink.sh
> set -eu
>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
> cd /home/yoh/.tmp/dl-je8JzWt
> git annex version
> head -n 1
git-annex version: 8.20200810+git47-g27329f0bb-1~ndall+1
> mkdir src
> cd src
> git init
Initialized empty Git repository in /home/yoh/.tmp/dl-je8JzWt/src/.git/
> git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
> git commit --allow-empty -m empty
[master (root-commit) be447d9] empty
> git annex adjust --unlock
adjust
Switched to branch 'adjusted/master(unlocked)'
ok
> git config annex.thin true
> touch 123
> git annex add 123
add 123
ok
(recording state in git...)
> git commit -m 123 123
[adjusted/master(unlocked) c5db915] 123
1 file changed, 1 insertion(+)
create mode 100644 123
> git clone --shared src dest
Cloning into 'dest'...
done.
> cd dest
> git annex init
init (merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
Repository was cloned with --shared; setting annex.hardlink=true and making repository untrusted.
ok
(recording state in git...)
> git config annex.thin true
> git annex get 123
get 123 (from origin...)
(checksum...) ok
(recording state in git...)
> : are inodes the 'same?'
> ls -lLi dest/123 dest/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 src/123 src/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
49711952 -rw------- 2 yoh yoh 0 Aug 25 11:04 dest/123
49711952 -rw------- 2 yoh yoh 0 Aug 25 11:04 dest/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
49711742 -rw------- 2 yoh yoh 0 Aug 25 11:04 src/123
49711742 -rw------- 2 yoh yoh 0 Aug 25 11:04 src/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
done, it seems that my documentation improvement is the only thing that needed to be done for this. --Joey
I believe this is by design. See ecd0684bf (avoid hard linking object from other repository when annex.thin is set, 2016-01-13).
That is the relevant commit indeed.
We have two different things that want to hard link a file for different purposes. The purposes are not compatible. So one of them has to win out.
If both annex.hardlink and annex.thin were both allowed to hard link the same file, then a file could get a link count of more than 2. That would prevent using the method that annex.thin relies on to avoid two unlocked files with the same content ending up hard linked together. With the very undesirable result that an edit to one file would also edit the other file.
Now, I could imagine using some other thing than link count to avoid that situation, eg keep track of when two files have the same content. But there's a second similar problem: If a file were unlocked in one repo and that repos's annex/objects also hard linked to its remote's annex/objects, then editing the file in the first repo would corrupt the object content in the other repo.
So I think the only other possible way for it to work would be for which ever of the two caused a hard link to be made first to win, rather than annex.thin always winning as it does now. So, after a
git annex get
in the clone hard linked the file (annex.hardlink wins),git annex unlock
would copy the file (annex.thin loses). While agit add
would hard link the file to annex/objects (annex.thin wins), resulting ingit annex copy --to origin
having to copy the file (annex.hardlink loses).But that would not change the behavior in your test case, since it needs to get the file (annex.hardlink wins) before unlocking it (annex.thin loses).
So the benefit of making that change seems small to nonexistent as far as this bug report goes. Behavior becomes less consistent, it has to work harder to enforce the link count invariant, and it doesn't actually change the test case. The only real benefit would be when some files are not unlocked, but annex.thin is set, since then the locked files could be hardlinked.
I added a note to the annex.hardlink docs about annex.thin winning and am inclined to only do that.
Thank you Joey for looking into it! I only wish that all file systems were CoW so we could quickly "clone/get" any amount of data without jeopardizing data as is the case with hardlinks and also that software did not follow symlinks all the way into the .git/annex/objects (which remains the issue even with browsers etc).
But unfortunately it is not the case and that is why we breed all the workarounds to fulfill the use-cases. For a "consumer" who is not intending to modify any of the original hardlinked files -- do you see any other way to get a "quick" clone of a heavy in GBs (or TBs) repository without somehow "marrying" shared+thin? May be for data protection there could be some "read-only-thin" mode which would make all annexed files read-only (thus in the original and clone) repositories thus making it "safe" (somewhat) to have inodes with more than 2 hardlinks?
git annex adjust --lock
seems like it would accomplish that.Unless symlinks don't work for some reason on this filesystem, but I don't think there are any where hard links work and symlinks don't.