Please describe the problem.
git annex export --fast main --to <remote> deletes existing files on a rsync ssh remote. My mental model was that --fast usually instructs git-annex to not do (slow) network connections.
My use-case for a fast export is to add an existing ssh-accessible non-git directory on a HPC system as a potential data source for a git-annex repository. The repository has additional information like how to retrieve the files from a third-party, while the directory on HPC only has the files (which were downloaded without git-annex involvement already). My plan was to add the directory as an exporttree remote, make git-annex think that the current main branches tree should be available there via the fast export, and then do a git annex fsck --from <remote> to discover what's actually there. Obviously it is very undesirable to loose those files on export then.
From what I understand I could hack around this if I graft the tree into the git-annex branch and write export.log myself, but I am wondering if I am just encountering a bug and this should work the way I wanted it to.
What steps will reproduce the problem?
- Create a git-annex repository and add some files
- Create a plain directory with (a subset of) the same filenames in the repository
- Add this directory as an rsync export remote:
git annex initremote <remote> type=rsync rsyncurl=<host>:<path> exporttree=yes encryption=none git annex export --fast main --to <remote>- Observe files being deleted on the remote
What version of git-annex are you using? On what operating system?
git-annex version: 10.20260115-ge8de977f1d5b5ac57cfe7a0c66d4e1c3ff337af1
build flags: Assistant Webapp Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV Servant OsPath
dependency versions: aws-0.25.2 bloomfilter-2.0.1.3 crypton-1.0.4 DAV-1.3.4 feed-1.3.2.1 ghc-9.10.3 http-client-0.7.19 torrent-10000.1.3 uuid-1.3.16 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external compute mask
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
I don't reproduce this:
Nothing is exported by
export --fast(which matches its documentation), and examining the files in the remote's directory, none are deleted or overwritten.When I later run
git-annex push, all files in the tree get exported. In cases where the remote's directory already contained a file with the same name, it is overwritten. That is as expected.You could do that with a special remote also configured with importtree=yes. No need to do anything special, just import from the remote, and git-annex will learn what files are on it.
Unfortunately, importtree is not supported by the rsync special remote
Your use case sounds like it might be one that importtree only remotes would support.
Sorry, I must have misunderstood the failure condition (the full project also involves an external special remote with a custom URL scheme and an external backend, which I have now ruled out as the cause). It is actually not the first export, but rather a later re-export once something has changed. What I have been encountering this with is after migrating some keys:
This seems to not only happen with a migrate, but also if the file-as-tracked-in-git changes in other ways.
The unexport makes some sense given that the git-tracked-file has changed, but since in my case it is only a backend migration the content is still the same. I think this unexport shouldn't happen at all with --fast though.
That this single file change also removes the other files in the same subdirectory, regardless of if they are present or not, is very surprising.
I do not trust the content that is stored in this directory on the HPC system. I am able to reproduce some of the newer files directly from the third-party they were downloaded from, but for older files I get very weird slightly different data. That's why I don't want to import anything from there. Instead, I am building this DataLad Dataset which records the requests necessary to fetch each file from the third-party (via datalad-cds' URLs) and then I am planning to download them all, record their checksums, and fsck the "legacy" data that we have in this non-version-controlled directory. In the end I also want to be able to populate this directory from the dataset with files downloaded via git-annex.
git annex copy --tofor export remotes would be nice for that, but as of now I would probably rsync them over myself and update git-annex' content tracking for the export remote myself.I don't think importtree or even importtree-only would be the right tool for this.
That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly,
git-annex getwould be able to get them from there.Reproduced that behavior.
What is happening here is that empty directories on the rsync special remote get cleaned up in a separate step after unexport of a file. It is unexporting
subdir/test1.bin. And in this situation, due to the use ofexport --fast, no files have been sent to the export remote yet. So as far as git-annex is concerned,subdir/there is an empty directory, and so it removes it.Now, since
subdir/test1.binnever did get sent to the remote, its old version does not actually need to be unexported before the new version is sent. Which would have avoided the cleanup and so avoided the problem. (Although I think there are probably good reasons for that unexport to be done, involving multi-writer situations. I would need to refresh my memory about some complicated stuff to say for sure.)But, the same thing can happen in other ways. For example, consider:
That also deletes any other files that a third party has written to
newdir/on the remote. And in this case, it really does need to unexportnewdir/foo.Note that the directory special remote does not behave the same way; it doesn't need the separate step to remove "empty" directories, and it just cleans up empty directories after removing a file from the export. But rsync does not have a way to delete a directory only when it's empty, which is why git-annex does the separate step to identify and remove empty directories. (From git-annex's perspective.) Also, the adb and webdav special remotes behave the same as rsync.
I don't know that git-annex documents anywhere that an exporttree remote avoids deleting files added to the remote by third parties. I did find it susprising that files with names that git-annex doesn't even know about get deleted in this case. On the other hand, if git-annex is told to export a tree containing file
foo, that is going to overwrite anyfoowritten to the remote by a third party, and I think that is expected behavior.Also note that importtree remotes don't have this problem. Including avoiding export overwriting files written by third parties.
Actually, it is possible to get rsync to delete a directory when it's empty, but preserve it otherwise. So I have implemented that.
The other remotes that I mentioned will still have this behavior. And at least in the case of webdav, I doubt it can be made to only delete empty directories.
Also note that the documentation is clear about this at the API level:
I get the impression that there is quite a bit of complexity with export remotes that makes it dangerous to not let them be managed by git-annex only, and changing that sounds rather complicated. Thanks for looking into it and making some improvements.
Ohh! Thanks for spelling it out. This sounds way more convenient than what I planned with fsck'ing the remote. Populating the directory with rsync and then importing again doesn't sound like too much overhead, should be fine.
Given that, I think I would be happy with import support for rsync.
Ok, I've tagged the todos about import support from rsync, and hopefully that will be able to get implemented.
As for this bug, it seems that at least documentation improvements are needed in order to close it. I have also fixed the adb special remote to avoid the behavior, which leaves webdav and any external special remotes that might have the behavior.