Please describe the problem.
I've been managing my about 50k photos (around 72k files) using git-annex. Recently I've decided to add a directory remote using importtree/exporttree in order to synchronize it with Nextcloud. To start slowly, I've first exported the files, then let Nextcloud upload them and then imported the files again. During the import, some files were re-imported, I think Nextcloud re-synced those files due to some errors that occurred during uploading. I've then compared the freshly create git branch to the current branch and found that while those re-imported files were not different, some other files had differences. On closer inspection, I found that these files have been replaced by identically sized files, leading, e.g., to the following diff:
diff --git a/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp b/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp
index 918bbfcc8ff..8c163b32ae2 120000
--- a/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp
+++ b/Profiles/Sony DSC-RX100M5A Camera Portrait.dcp
@@ -1 +1 @@
-../.git/annex/objects/Pj/xF/SHA256E-s278110--e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f.dcp/SHA256E-s278110--e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f.dcp
\ No newline at end of file
+../.git/annex/objects/5f/Kg/SHA256E-s278110--82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63.dcp/SHA256E-s278110--82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63.dcp
\ No newline at end of file
These are camera profiles, and the file it has been replaced with is a different camera profile in the same directory. I've checked in the exported directory and the two camera files have the correct SHA256 hash there. To be precise, the file there has hash e4a0dcd12fa02840605dbc4f7c6e258ff3002faf3fc3dc2ca262e8bbf1cf542f
while the commit created during import claims it has the hash 82041bcc454f77ee5b4a55522178cdc9d5aa2d78b03176236add7e28dfac2b63
. I have similar changes in Audacity project files that I also have in the same repository where 8 out of 22 files have been replaced by others. Of these 22 files, 20 have the same size. I have seen one case where two files were replaced by the same target file, apart from that from a manual inspection, the hashes of the files in the diff seem to be disjoint.
The problem remains when running the import command again and when merging the remote, there are actually duplicates in the affected directories. I've rolled back the merge using git reset --merge
. I assume touching the affected files in the export and re-importing could solve the problem, but I would like to preserve any state that could help with debugging for now as I would really like to know the root cause of this to avoid that this happens again.
What steps will reproduce the problem?
I have tried reproducing the problem using the set of camera profiles in a new git-annex repository, but without any luck. I've also tried generating equally-sized random files, again without any luck. In theory, the steps to reproduce should be the following, but the resulting diff is always empty:
mkdir annex-test
cd annex-test/
git init .
git annex init
mkdir ../d
git annex initremote d type=directory directory=../d exporttree=yes importtree=yes encryption=none
# create some files
for i in $(seq 32); do dd if=/dev/urandom of=file${i}.test bs=200k count=10; done
git annex add .
git commit -m "initial commit"
git annex export master --to d
git annex import master --from d
git diff d/master
What version of git-annex are you using? On what operating system?
git-annex version: 10.20220222-g1c4b0b4c2
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-9.0.2 http-client-0.7.10 persistent-sqlite-2.13.0.3 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.1.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 8
I'm using Arch Linux.
Please provide any additional information below.
Unfortunately, I don't have the output of git-annex anymore. However, the used commands, in particular for setting up the remote, exporting and importing were identical to those listed above (apart from the name of the directory and the name of the remote). I'm also quite certain that the files that are now listed as different were not listed in the output of the import command.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I've been successfully using git-annex to manage the growing collection of photos I take for several years now. Synchronizing the collection across several disks while files are renamed and some disks are only rarely synchronized and some contain only a subset of the photos wouldn't have been possible without git-annex. Also, using git-annex gives me the safety that the original photos will never be modified. There is nothing better than using git-annex to verify that all photos have been copied unmodified from a memory card before wiping it.
The directory special remote use the size and mtime as the ContentIdentifier of a file. It used to use inode number as well, but some filesystems make up new inodes on each mount so that was dropped.
So when two files have the same size and mtime, it will import one of them, and then when it comes to the other one, see that it already has that ContentIdentifier and use the content it already imported.
For example:
Perhaps NextCloud is for some reason giving multiple files the same mtime?
This is rather unlikely to happen naturally, especially on modern filesystems that have high resolution mtimes. Even "echo 1> 1; echo 2>2" generates files with 2 different mtimes. To get the same mtime, the two files have to be written by the same process at very close to the same time.
Probably that's why NextCloud managed to get the same mtime, although why it would be rewriting the content of the file when the content had not changed I don't know and perhaps that's a bug of some kind in it.
So this makes me think that [[!73df633a6215faa093e4a7524c6d328aa988aed1 ]] needs to be reverted and the inode number used again. But that did fix a bug of a sort, that importing from a fat filesystem after it was remounted would re-import everything. So I suppose the best that can be done is to make it configurable whether inodes should be ignored or not when importing from a directory special remote. And presumably default to not ignoring them.
done
First of all, thank you very much for the quick fix!
I don't think that Nextcloud rewrote these files as otherwise git-annex would have re-imported them instead of recognizing the exported files. Using stat, I confirm that the replaced files have indeed the exactly same modify time -
Modify: 2021-12-19 22:47:35.988712668 +0100
. I think this is the time when I copied these files from a different directory, as this does not correspond to the modify time of the original files (that I still have). Quite interestingly, the access time of both original files is also the exactly same time, which supports this hypothesis. I can reproduce that copying a couple of small files using cp gives me target files with exactly identical modify times (in fact, all time stamps stat displays are identical). Steps to reproduce:For me, this reproduces both on ext4 and on tmpfs.
Makes sense, cp is just the kind of program that can write 2 files at very close to the same time, and when small enough, get the same mtime.