Recent comments posted to this site:
This is due to the assistant not supporting submodules. Nothing has ever been done to make it support them.
When git check-ignore --stdin is passed a path in a submodule, it exits.
We can see this happen near the top of the log:
fatal: Pathspec 'code/containers/.codespellrc' is in submodule 'code/containers'
git check-ignore EOF: user error
The subseqent "resource vanished (Broken pipe)" are each time git-annex tries to talk to git check-ignore.
Indeed, looking at the source code to check-ignore, if it's passed a path inside a submodule, it errors out, and so won't be listening to stdin for any more paths:
joey@darkstar:~/tmp/t>git check-ignore --stdin
r/x
fatal: Pathspec 'r/x' is in submodule 'r'
- exit 128
And I was able to reproduce this by having a submodule with a file in it, and starting the assistant.
In some cases, the assistant still added files despite check-ignore having crashed. (It will even add gitignored files when check-ignore has crashed.) In other cases not. The problem probably extends beyond check-ignore to also staging files. Eg, "git add submodule/foo bar" will error out on the file in the submodule and not ever get to the point of adding the second file.
Fixing this would need an inexpensive way to query git about whether a file
is in a submodule. Passing the files that
the assistant gathers through git ls-files --modified --others
might be the only way to do that.
Using that at all efficiently would need some other changes, because it needs to come before the ignore check, which it currently does for each file event. The ignore check would need to be moved to the point where a set of files has been gathered, so ls-files can be run once on the set of files.
ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file
It's somewhat more complex than that, but git-annex's locking does take concurrenct into account.
The transfer locking in specific is there to avoid issues like git-annex
get of the same file being run in the same repo in 2 different terminals.
So it intentionally does not allow concurrency. Except for in this
particular case where multiple clients are downloading.
Reproducer worked for me.
This seems specific to using a local git remote, it will not happen over ssh.
In Remote.Git, copyFromRemote calls runTransfer in the remote repository.
That should be alwaysRunTransfer as is usually used when git-annex is
running as a server to send files. Which avoids this problem.
I can't follow the link as it appears to be broken, so couldn't look whether there was a fix for the high ram usage.
My borg repo is only about 100 GB, but it has a large amount of files:
local annex keys: 850128
local annex size: 117.05 gigabytes
annexed files in working tree: 1197089
The first sync action after my first borg backup (only one snapshot) is currently using about 26 GB of my 32 GB of RAM and possibly still climbing.
get since there could be hundreds of clients trying to get that file. If locking is needed to update git-annex branch, better be journalled + flushed at once or locked only for git-annex branch edit (anyways better to debounce multiple operations)
I get the impression that there is quite a bit of complexity with export remotes that makes it dangerous to not let them be managed by git-annex only, and changing that sounds rather complicated. Thanks for looking into it and making some improvements.
I am planning to download them all, record their checksums
That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly, git-annex get would be able to get them from there.
Ohh! Thanks for spelling it out. This sounds way more convenient than what I planned with fsck'ing the remote. Populating the directory with rsync and then importing again doesn't sound like too much overhead, should be fine.
Given that, I think I would be happy with import support for rsync.
Actually, it is possible to get rsync to delete a directory when it's empty, but preserve it otherwise. So I have implemented that.
The other remotes that I mentioned will still have this behavior. And at least in the case of webdav, I doubt it can be made to only delete empty directories.
Also note that the documentation is clear about this at the API level:
`REMOVEEXPORTDIRECTORY Directory`
[...]
Typically the directory will be empty, but it could possbly contain
files or other directories, and it's ok to remove those.
Reproduced that behavior.
What is happening here is that empty directories on the rsync special
remote get cleaned up in a separate step after unexport of a file. It is
unexporting subdir/test1.bin. And in this situation, due to the use of
export --fast, no files have been sent to the export remote yet. So as
far as git-annex is concerned, subdir/ there is an empty directory, and
so it removes it.
Now, since subdir/test1.bin never did get sent to the remote, its old version
does not actually need to be unexported before the new version is sent. Which
would have avoided the cleanup and so avoided the problem. (Although I think
there are probably good reasons for that unexport to be done, involving
multi-writer situations. I would need to refresh my memory about some
complicated stuff to say for sure.)
But, the same thing can happen in other ways. For example, consider:
mkdir newdir
touch newdir/foo
git-annex add newdir/foo
git commit -m add
git-annex export master --to rsync
git rm newdir/foo
git commit -m rm
git-annex export master --to rsync
That also deletes any other files that a third party has written to
newdir/ on the remote. And in this case, it really does need to
unexport newdir/foo.
Note that the directory special remote does not behave the same way; it doesn't need the separate step to remove "empty" directories, and it just cleans up empty directories after removing a file from the export. But rsync does not have a way to delete a directory only when it's empty, which is why git-annex does the separate step to identify and remove empty directories. (From git-annex's perspective.) Also, the adb and webdav special remotes behave the same as rsync.
I don't know that git-annex documents anywhere that an exporttree remote
avoids deleting files added to the remote by third parties. I did find it
susprising that files with names that git-annex doesn't even know about get
deleted in this case. On the other hand, if git-annex is told to export a tree
containing file foo, that is going to overwrite any foo written to the
remote by a third party, and I think that is expected behavior.
Also note that importtree remotes don't have this problem. Including avoiding export overwriting files written by third parties.