Recent comments posted to this site:

Same issue

I can't follow the link as it appears to be broken, so couldn't look whether there was a fix for the high ram usage.

My borg repo is only about 100 GB, but it has a large amount of files:

local annex keys: 850128
local annex size: 117.05 gigabytes
annexed files in working tree: 1197089

The first sync action after my first borg backup (only one snapshot) is currently using about 26 GB of my 32 GB of RAM and possibly still climbing.

Comment by nadir
comment 2
ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file. If locking is needed to update git-annex branch, better be journalled + flushed at once or locked only for git-annex branch edit (anyways better to debounce multiple operations)
Comment by yarikoptic
comment 7

I get the impression that there is quite a bit of complexity with export remotes that makes it dangerous to not let them be managed by git-annex only, and changing that sounds rather complicated. Thanks for looking into it and making some improvements.

    I am planning to download them all, record their checksums

That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly, git-annex get would be able to get them from there.

Ohh! Thanks for spelling it out. This sounds way more convenient than what I planned with fsck'ing the remote. Populating the directory with rsync and then importing again doesn't sound like too much overhead, should be fine.

Given that, I think I would be happy with import support for rsync.

Comment by matrss
comment 6

Actually, it is possible to get rsync to delete a directory when it's empty, but preserve it otherwise. So I have implemented that.

The other remotes that I mentioned will still have this behavior. And at least in the case of webdav, I doubt it can be made to only delete empty directories.

Also note that the documentation is clear about this at the API level:

`REMOVEEXPORTDIRECTORY Directory`
[...]
Typically the directory will be empty, but it could possbly contain
files or other directories, and it's ok to remove those.
Comment by joey
comment 5

Reproduced that behavior.

What is happening here is that empty directories on the rsync special remote get cleaned up in a separate step after unexport of a file. It is unexporting subdir/test1.bin. And in this situation, due to the use of export --fast, no files have been sent to the export remote yet. So as far as git-annex is concerned, subdir/ there is an empty directory, and so it removes it.

Now, since subdir/test1.bin never did get sent to the remote, its old version does not actually need to be unexported before the new version is sent. Which would have avoided the cleanup and so avoided the problem. (Although I think there are probably good reasons for that unexport to be done, involving multi-writer situations. I would need to refresh my memory about some complicated stuff to say for sure.)

But, the same thing can happen in other ways. For example, consider:

mkdir newdir
touch newdir/foo
git-annex add newdir/foo
git commit -m add
git-annex export master --to rsync
git rm newdir/foo
git commit -m rm
git-annex export master --to rsync

That also deletes any other files that a third party has written to newdir/ on the remote. And in this case, it really does need to unexport newdir/foo.

Note that the directory special remote does not behave the same way; it doesn't need the separate step to remove "empty" directories, and it just cleans up empty directories after removing a file from the export. But rsync does not have a way to delete a directory only when it's empty, which is why git-annex does the separate step to identify and remove empty directories. (From git-annex's perspective.) Also, the adb and webdav special remotes behave the same as rsync.

I don't know that git-annex documents anywhere that an exporttree remote avoids deleting files added to the remote by third parties. I did find it susprising that files with names that git-annex doesn't even know about get deleted in this case. On the other hand, if git-annex is told to export a tree containing file foo, that is going to overwrite any foo written to the remote by a third party, and I think that is expected behavior.

Also note that importtree remotes don't have this problem. Including avoiding export overwriting files written by third parties.

Comment by joey
comment 4

I am planning to download them all, record their checksums

That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly, git-annex get would be able to get them from there.

Comment by joey
comment 3

You could do that with a special remote also configured with importtree=yes. No need to do anything special, just import from the remote, and git-annex will learn what files are on it. [...] Your use case sounds like it might be one that importtree only remotes would support.

I do not trust the content that is stored in this directory on the HPC system. I am able to reproduce some of the newer files directly from the third-party they were downloaded from, but for older files I get very weird slightly different data. That's why I don't want to import anything from there. Instead, I am building this DataLad Dataset which records the requests necessary to fetch each file from the third-party (via datalad-cds' URLs) and then I am planning to download them all, record their checksums, and fsck the "legacy" data that we have in this non-version-controlled directory. In the end I also want to be able to populate this directory from the dataset with files downloaded via git-annex. git annex copy --to for export remotes would be nice for that, but as of now I would probably rsync them over myself and update git-annex' content tracking for the export remote myself.

I don't think importtree or even importtree-only would be the right tool for this.

Comment by matrss
comment 2

Sorry, I must have misunderstood the failure condition (the full project also involves an external special remote with a custom URL scheme and an external backend, which I have now ruled out as the cause). It is actually not the first export, but rather a later re-export once something has changed. What I have been encountering this with is after migrating some keys:

icg149@icg1911:~/Playground$ datalad create test-export-repo
create(ok): /home/icg149/Playground/test-export-repo (dataset)
icg149@icg1911:~/Playground$ cd test-export-repo
icg149@icg1911:~/Playground/test-export-repo$ mkdir subdir
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test1.bin
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test2.bin
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test3.bin
icg149@icg1911:~/Playground/test-export-repo$ datalad save
add(ok): subdir/test1.bin (file)                                                                                                                              
add(ok): subdir/test2.bin (file)                                                                                                                              
add(ok): subdir/test3.bin (file)                                                                                                                              
save(ok): . (dataset)                                                                                                                                         
action summary:                                                                                                                                               
  add (ok: 3)
  save (ok: 1)
icg149@icg1911:~/Playground/test-export-repo$ mkdir -p ../test-export-dir/subdir
icg149@icg1911:~/Playground/test-export-repo$ cp subdir/test*.bin ../test-export-dir/subdir
icg149@icg1911:~/Playground/test-export-repo$ git annex initremote rsync type=rsync rsyncurl=../test-export-dir exporttree=yes encryption=none autoenable=true
initremote rsync ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ git annex drop --force subdir/test2.bin
drop subdir/test2.bin ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ git annex export --fast main --to rsync
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ tree ../test-export-dir/
../test-export-dir/
└── subdir
    ├── test1.bin
    ├── test2.bin
    └── test3.bin

2 directories, 3 files
icg149@icg1911:~/Playground/test-export-repo$ git annex migrate --backend SHA256E subdir/test1.bin
migrate subdir/test1.bin (checksum...) (checksum...) ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ datalad save
save(ok): . (dataset)                                                                                                                                         
icg149@icg1911:~/Playground/test-export-repo$ git annex export --fast main --to rsync
unexport rsync subdir/test1.bin ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ tree ../test-export-dir/
../test-export-dir/

0 directories, 0 files

This seems to not only happen with a migrate, but also if the file-as-tracked-in-git changes in other ways.

The unexport makes some sense given that the git-tracked-file has changed, but since in my case it is only a backend migration the content is still the same. I think this unexport shouldn't happen at all with --fast though.

That this single file change also removes the other files in the same subdirectory, regardless of if they are present or not, is very surprising.

Comment by matrss