export tree bug when two files with the same content both should be removedgit-annexhttp://git-annex.branchable.com/bugs/Versioned_S3_tree_does_not_unexport_git_objects/git-annexikiwiki2022-11-09T19:51:20Zcomment 1http://git-annex.branchable.com/bugs/Versioned_S3_tree_does_not_unexport_git_objects/comment_1_70428ebc4027538253edc483dc5cb971/joey2022-11-09T19:18:34Z2022-11-09T18:16:29Z
<p>I tried making a repository with just 2 files, one in git and one in git-annex,
and am unable to reproduce the bug. Here is
what <code>git-annex export master --to remote --debug</code> showed when
exporting a tree that deleted file "foo" which was a git object:</p>
<pre><code>[2022-11-09 14:16:09.513485291] (Remote.S3) String to sign: "DELETE\n/foo\n\nhost:t-a9b9d406-30e5-41cc-a74c-c5d83b2953fb.s3.amazonaws.com\nx-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\nx-amz-date:20221109T181609Z\n\nhost;x-amz-content-sha256;x-amz-date\ne3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
[2022-11-09 14:16:09.513608493] (Remote.S3) Host: "t-a9b9d406-30e5-41cc-a74c-c5d83b2953fb.s3.amazonaws.com"
[2022-11-09 14:16:09.513697445] (Remote.S3) Path: "/foo"
[2022-11-09 14:16:09.513748086] (Remote.S3) Query string: ""
[2022-11-09 14:16:09.513829584] (Remote.S3) Header: (redacted -- JEH)
[2022-11-09 14:16:09.687814925] (Remote.S3) Response status: Status {statusCode = 204, statusMessage = "No Content"}
</code></pre>
<p>The S3 console showed that the file was deleted from the bucket.
And as far as the S3 remote implementation is concerned, there should
not be anything different between a git object and a git-annex object.
At the level of the S3 remote both have a git-annex key that it deletes
in the same way.</p>
<p>In your log, the only thing it does with the file is export it, but it does
not later unexport it:</p>
<pre><code>$ grep baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json Versioned_S3_tree_does_not_unexport_git_objects.mdwn
Git files in the 1.0.0 tag are still present in the S3 1.0.1 export. sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json is an example file not present in 1.0.1 that is still present on S3.
export s3-PUBLIC sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
$
</code></pre>
<p>If the bug is that it's somehow failing to try to unexport the file,
that should happen independently of the special remote type, so would also
happen with a directory special remote. So I tried that:</p>
<pre><code>$ git clone https://github.com/openneurodatasets/ds001705
$ cd ds001705
$ git-annex get --branch=tags/1.0.0
$ git-annex get --branch=tags/1.0.1
$ mkdir ../d
$ git-annex initremote d type=directory directory=../d encryption=none exporttree=yes
$ git-annex export 1.0.0 --to d
$ ls ../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
$ git-annex export 1.0.1 --to d
$ ls ../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
</code></pre>
<p>Ok, so, nothing to do with S3 or versioning at all.</p>
comment 2http://git-annex.branchable.com/bugs/Versioned_S3_tree_does_not_unexport_git_objects/comment_2_059b9beb31d9cbc97ea4a59f47d2e63d/joey2022-11-09T19:18:34Z2022-11-09T19:15:59Z
<p>Interestingly, it does not happen in simpler situations:</p>
<pre><code>$ git checkout 1.0.0
$ git-annex export 1.0.0 --to d
$ git rm sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
$ git commit -m removed
$ git-annex export HEAD --to d
unexport d sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json ok
$ ls ../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
ls: cannot access '../d/sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json': No such file or directory
</code></pre>
<p>So something about the diff between 1.0.0 and 1.0.1 is somehow causing the
bug..</p>
comment 3http://git-annex.branchable.com/bugs/Versioned_S3_tree_does_not_unexport_git_objects/comment_3_176cbc137afb5cf8841ff9114b111fef/joey2022-11-09T19:46:22Z2022-11-09T19:41:33Z
<p>Ok, the bug is due to 2 files that have the same content.</p>
<pre><code>sub-000101/ses-baseline/pet/sub-000101_ses-baseline_rec-MLEM_pet.json
sub-000101/ses-displaced/pet/sub-000101_ses-displaced_rec-MLEM_pet.json
</code></pre>
<p>Both files get deleted. And the bug makes it only pick one of the two files
to delete, because it's using a map from key to file and the second file
overwrites the first in the map.</p>
<p>So this would also presumably affect annexed files when two have the same
content and are being deleted.</p>
comment 4http://git-annex.branchable.com/bugs/Versioned_S3_tree_does_not_unexport_git_objects/comment_4_a6490af0427bbe4363bea824d55a7593/nell2022-11-09T19:51:20Z2022-11-09T19:51:20Z
Nice, thank you for looking into this, Joey! That makes sense why it would appear to be git files, we tend to have identical duplicate metadata files much more often than duplicate image files within datasets.