Please describe the problem.

I've been managing my about 50k photos (around 72k files) using git-annex. Recently I've decided to add a directory remote using importtree/exporttree in order to synchronize it with Nextcloud. To start slowly, I've first exported the files, then let Nextcloud upload them and then imported the files again. During the import, some files were re-imported, I think Nextcloud re-synced those files due to some errors that occurred during uploading. I've then compared the freshly create git branch to the current branch and found that while those re-imported files were not different, some other files had differences. On closer inspection, I found that these files have been replaced by identically sized files, leading, e.g., to the following diff:

diff --git a/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp b/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp
index 918bbfcc8ff..8c163b32ae2 120000
--- a/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp    
+++ b/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp    
@@ -1 +1 @@
-../.git/annex/objects/Pj/xF/SHA256E-s278110--e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f.dcp/SHA256E-s278110--e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f.dcp
\ No newline at end of file
+../.git/annex/objects/5f/Kg/SHA256E-s278110--82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63.dcp/SHA256E-s278110--82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63.dcp
\ No newline at end of file

These are camera profiles, and the file it has been replaced with is a different camera profile in the same directory. I've checked in the exported directory and the two camera files have the correct SHA256 hash there. To be precise, the file there has hash e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f while the commit created during import claims it has the hash 82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63. I have similar changes in Audacity project files that I also have in the same repository where 8 out of 22 files have been replaced by others. Of these 22 files, 20 have the same size. I have seen one case where two files were replaced by the same target file, apart from that from a manual inspection, the hashes of the files in the diff seem to be disjoint.

The problem remains when running the import command again and when merging the remote, there are actually duplicates in the affected directories. I've rolled back the merge using git reset --merge. I assume touching the affected files in the export and re-importing could solve the problem, but I would like to preserve any state that could help with debugging for now as I would really like to know the root cause of this to avoid that this happens again.

What steps will reproduce the problem?

I have tried reproducing the problem using the set of camera profiles in a new git-annex repository, but without any luck. I've also tried generating equally-sized random files, again without any luck. In theory, the steps to reproduce should be the following, but the resulting diff is always empty:

mkdir annex-test
cd annex-test/
git init .
git annex init
mkdir ../d
git annex initremote d type=directory directory=../d exporttree=yes importtree=yes encryption=none
# create some files
for i in $(seq 32); do dd if=/dev/urandom of=file${i}.test bs=200k count=10; done
git annex add .
git commit -m "initial commit"
git annex export master --to d
git annex import master --from d
git diff d/master

What version of git-annex are you using? On what operating system?

git-annex version: 10.20220222-g1c4b0b4c2
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-9.0.2 http-client-0.7.10 persistent-sqlite-2.13.0.3 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.1.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 8

I'm using Arch Linux.

Please provide any additional information below.

Unfortunately, I don't have the output of git-annex anymore. However, the used commands, in particular for setting up the remote, exporting and importing were identical to those listed above (apart from the name of the directory and the name of the remote). I'm also quite certain that the files that are now listed as different were not listed in the output of the import command.

Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)

I've been successfully using git-annex to manage the growing collection of photos I take for several years now. Synchronizing the collection across several disks while files are renamed and some disks are only rarely synchronized and some contain only a subset of the photos wouldn't have been possible without git-annex. Also, using git-annex gives me the safety that the original photos will never be modified. There is nothing better than using git-annex to verify that all photos have been copied unmodified from a memory card before wiping it.

fixed --Joey