Recent comments posted to this site:

comment 1

This is due to the assistant not supporting submodules. Nothing has ever been done to make it support them.

When git check-ignore --stdin is passed a path in a submodule, it exits. We can see this happen near the top of the log:

fatal: Pathspec 'code/containers/.codespellrc' is in submodule 'code/containers'

git check-ignore EOF: user error

The subseqent "resource vanished (Broken pipe)" are each time git-annex tries to talk to git check-ignore.

Indeed, looking at the source code to check-ignore, if it's passed a path inside a submodule, it errors out, and so won't be listening to stdin for any more paths:

joey@darkstar:~/tmp/t>git check-ignore --stdin
r/x
fatal: Pathspec 'r/x' is in submodule 'r'
- exit 128

And I was able to reproduce this by having a submodule with a file in it, and starting the assistant.

In some cases, the assistant still added files despite check-ignore having crashed. (It will even add gitignored files when check-ignore has crashed.) In other cases not. The problem probably extends beyond check-ignore to also staging files. Eg, "git add submodule/foo bar" will error out on the file in the submodule and not ever get to the point of adding the second file.

Fixing this would need an inexpensive way to query git about whether a file is in a submodule. Passing the files that the assistant gathers through git ls-files --modified --others might be the only way to do that.

Using that at all efficiently would need some other changes, because it needs to come before the ignore check, which it currently does for each file event. The ignore check would need to be moved to the point where a set of files has been gathered, so ls-files can be run once on the set of files.

Comment by joey
comment 4

ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file

It's somewhat more complex than that, but git-annex's locking does take concurrenct into account.

The transfer locking in specific is there to avoid issues like git-annex get of the same file being run in the same repo in 2 different terminals. So it intentionally does not allow concurrency. Except for in this particular case where multiple clients are downloading.

Comment by joey
comment 3

Reproducer worked for me.

This seems specific to using a local git remote, it will not happen over ssh.

In Remote.Git, copyFromRemote calls runTransfer in the remote repository. That should be alwaysRunTransfer as is usually used when git-annex is running as a server to send files. Which avoids this problem.

Comment by joey
Same issue

I can't follow the link as it appears to be broken, so couldn't look whether there was a fix for the high ram usage.

My borg repo is only about 100 GB, but it has a large amount of files:

local annex keys: 850128
local annex size: 117.05 gigabytes
annexed files in working tree: 1197089

The first sync action after my first borg backup (only one snapshot) is currently using about 26 GB of my 32 GB of RAM and possibly still climbing.

Comment by nadir
comment 2
ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file. If locking is needed to update git-annex branch, better be journalled + flushed at once or locked only for git-annex branch edit (anyways better to debounce multiple operations)
Comment by yarikoptic
comment 7

I get the impression that there is quite a bit of complexity with export remotes that makes it dangerous to not let them be managed by git-annex only, and changing that sounds rather complicated. Thanks for looking into it and making some improvements.

    I am planning to download them all, record their checksums

That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly, git-annex get would be able to get them from there.

Ohh! Thanks for spelling it out. This sounds way more convenient than what I planned with fsck'ing the remote. Populating the directory with rsync and then importing again doesn't sound like too much overhead, should be fine.

Given that, I think I would be happy with import support for rsync.

Comment by matrss
comment 6

Actually, it is possible to get rsync to delete a directory when it's empty, but preserve it otherwise. So I have implemented that.

The other remotes that I mentioned will still have this behavior. And at least in the case of webdav, I doubt it can be made to only delete empty directories.

Also note that the documentation is clear about this at the API level:

`REMOVEEXPORTDIRECTORY Directory`
[...]
Typically the directory will be empty, but it could possbly contain
files or other directories, and it's ok to remove those.
Comment by joey
comment 5

Reproduced that behavior.

What is happening here is that empty directories on the rsync special remote get cleaned up in a separate step after unexport of a file. It is unexporting subdir/test1.bin. And in this situation, due to the use of export --fast, no files have been sent to the export remote yet. So as far as git-annex is concerned, subdir/ there is an empty directory, and so it removes it.

Now, since subdir/test1.bin never did get sent to the remote, its old version does not actually need to be unexported before the new version is sent. Which would have avoided the cleanup and so avoided the problem. (Although I think there are probably good reasons for that unexport to be done, involving multi-writer situations. I would need to refresh my memory about some complicated stuff to say for sure.)

But, the same thing can happen in other ways. For example, consider:

mkdir newdir
touch newdir/foo
git-annex add newdir/foo
git commit -m add
git-annex export master --to rsync
git rm newdir/foo
git commit -m rm
git-annex export master --to rsync

That also deletes any other files that a third party has written to newdir/ on the remote. And in this case, it really does need to unexport newdir/foo.

Note that the directory special remote does not behave the same way; it doesn't need the separate step to remove "empty" directories, and it just cleans up empty directories after removing a file from the export. But rsync does not have a way to delete a directory only when it's empty, which is why git-annex does the separate step to identify and remove empty directories. (From git-annex's perspective.) Also, the adb and webdav special remotes behave the same as rsync.

I don't know that git-annex documents anywhere that an exporttree remote avoids deleting files added to the remote by third parties. I did find it susprising that files with names that git-annex doesn't even know about get deleted in this case. On the other hand, if git-annex is told to export a tree containing file foo, that is going to overwrite any foo written to the remote by a third party, and I think that is expected behavior.

Also note that importtree remotes don't have this problem. Including avoiding export overwriting files written by third parties.

Comment by joey