Recent comments posted to this site:
I get the impression that there is quite a bit of complexity with export remotes that makes it dangerous to not let them be managed by git-annex only, and changing that sounds rather complicated. Thanks for looking into it and making some improvements.
I am planning to download them all, record their checksums
That's what an import does do.. You would end up with an imported tree which you could diff with your known correct tree, and see what's different there, and for all files that are stored on the remote correctly, git-annex get would be able to get them from there.
Ohh! Thanks for spelling it out. This sounds way more convenient than what I planned with fsck'ing the remote. Populating the directory with rsync and then importing again doesn't sound like too much overhead, should be fine.
Given that, I think I would be happy with import support for rsync.
Actually, it is possible to get rsync to delete a directory when it's empty, but preserve it otherwise. So I have implemented that.
The other remotes that I mentioned will still have this behavior. And at least in the case of webdav, I doubt it can be made to only delete empty directories.
Also note that the documentation is clear about this at the API level:
`REMOVEEXPORTDIRECTORY Directory`
[...]
Typically the directory will be empty, but it could possbly contain
files or other directories, and it's ok to remove those.
Reproduced that behavior.
What is happening here is that empty directories on the rsync special
remote get cleaned up in a separate step after unexport of a file. It is
unexporting subdir/test1.bin. And in this situation, due to the use of
export --fast, no files have been sent to the export remote yet. So as
far as git-annex is concerned, subdir/ there is an empty directory, and
so it removes it.
Now, since subdir/test1.bin never did get sent to the remote, its old version
does not actually need to be unexported before the new version is sent. Which
would have avoided the cleanup and so avoided the problem. (Although I think
there are probably good reasons for that unexport to be done, involving
multi-writer situations. I would need to refresh my memory about some
complicated stuff to say for sure.)
But, the same thing can happen in other ways. For example, consider:
mkdir newdir
touch newdir/foo
git-annex add newdir/foo
git commit -m add
git-annex export master --to rsync
git rm newdir/foo
git commit -m rm
git-annex export master --to rsync
That also deletes any other files that a third party has written to
newdir/ on the remote. And in this case, it really does need to
unexport newdir/foo.
Note that the directory special remote does not behave the same way; it doesn't need the separate step to remove "empty" directories, and it just cleans up empty directories after removing a file from the export. But rsync does not have a way to delete a directory only when it's empty, which is why git-annex does the separate step to identify and remove empty directories. (From git-annex's perspective.) Also, the adb and webdav special remotes behave the same as rsync.
I don't know that git-annex documents anywhere that an exporttree remote
avoids deleting files added to the remote by third parties. I did find it
susprising that files with names that git-annex doesn't even know about get
deleted in this case. On the other hand, if git-annex is told to export a tree
containing file foo, that is going to overwrite any foo written to the
remote by a third party, and I think that is expected behavior.
Also note that importtree remotes don't have this problem. Including avoiding export overwriting files written by third parties.
I am planning to download them all, record their checksums
That's what an import does do.. You would end up with an imported tree
which you could diff with your known correct tree, and see what's different
there, and for all files that are stored on the remote correctly,
git-annex get would be able to get them from there.
You could do that with a special remote also configured with importtree=yes. No need to do anything special, just import from the remote, and git-annex will learn what files are on it. [...] Your use case sounds like it might be one that importtree only remotes would support.
I do not trust the content that is stored in this directory on the HPC system. I am able to reproduce some of the newer files directly from the third-party they were downloaded from, but for older files I get very weird slightly different data. That's why I don't want to import anything from there. Instead, I am building this DataLad Dataset which records the requests necessary to fetch each file from the third-party (via datalad-cds' URLs) and then I am planning to download them all, record their checksums, and fsck the "legacy" data that we have in this non-version-controlled directory. In the end I also want to be able to populate this directory from the dataset with files downloaded via git-annex. git annex copy --to for export remotes would be nice for that, but as of now I would probably rsync them over myself and update git-annex' content tracking for the export remote myself.
I don't think importtree or even importtree-only would be the right tool for this.
Sorry, I must have misunderstood the failure condition (the full project also involves an external special remote with a custom URL scheme and an external backend, which I have now ruled out as the cause). It is actually not the first export, but rather a later re-export once something has changed. What I have been encountering this with is after migrating some keys:
icg149@icg1911:~/Playground$ datalad create test-export-repo
create(ok): /home/icg149/Playground/test-export-repo (dataset)
icg149@icg1911:~/Playground$ cd test-export-repo
icg149@icg1911:~/Playground/test-export-repo$ mkdir subdir
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test1.bin
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test2.bin
icg149@icg1911:~/Playground/test-export-repo$ head -c 10K /dev/urandom > subdir/test3.bin
icg149@icg1911:~/Playground/test-export-repo$ datalad save
add(ok): subdir/test1.bin (file)
add(ok): subdir/test2.bin (file)
add(ok): subdir/test3.bin (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
save (ok: 1)
icg149@icg1911:~/Playground/test-export-repo$ mkdir -p ../test-export-dir/subdir
icg149@icg1911:~/Playground/test-export-repo$ cp subdir/test*.bin ../test-export-dir/subdir
icg149@icg1911:~/Playground/test-export-repo$ git annex initremote rsync type=rsync rsyncurl=../test-export-dir exporttree=yes encryption=none autoenable=true
initremote rsync ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ git annex drop --force subdir/test2.bin
drop subdir/test2.bin ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ git annex export --fast main --to rsync
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ tree ../test-export-dir/
../test-export-dir/
└── subdir
├── test1.bin
├── test2.bin
└── test3.bin
2 directories, 3 files
icg149@icg1911:~/Playground/test-export-repo$ git annex migrate --backend SHA256E subdir/test1.bin
migrate subdir/test1.bin (checksum...) (checksum...) ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ datalad save
save(ok): . (dataset)
icg149@icg1911:~/Playground/test-export-repo$ git annex export --fast main --to rsync
unexport rsync subdir/test1.bin ok
(recording state in git...)
icg149@icg1911:~/Playground/test-export-repo$ tree ../test-export-dir/
../test-export-dir/
0 directories, 0 files
This seems to not only happen with a migrate, but also if the file-as-tracked-in-git changes in other ways.
The unexport makes some sense given that the git-tracked-file has changed, but since in my case it is only a backend migration the content is still the same. I think this unexport shouldn't happen at all with --fast though.
That this single file change also removes the other files in the same subdirectory, regardless of if they are present or not, is very surprising.
Added --cpus option. And avoided a high -J value increasing the number of capabilities higher than the available number of cpus.
I think you might want to use something like: git-annex p2phttp --cpus=2 --jobs=100
Seems I was misremembering details of how ghc's "capabilities" work. From its manual:
Each capability can run one Haskell thread at a time, so the number of capabilities is equal to the number of Haskell threads that can run physically in parallel. A capability is animated by one or more OS threads; the runtime manages a pool of OS threads for each capability, so that if a Haskell thread makes a foreign call (see Multi-threading and the FFI) another OS thread can take over that capability.
Currently git-annex raises the number of capabilities to the -J value.
Probably the thread pool starts at 2 threads to have one spare preallocated for the first FFI call, explaining why each -J doubles the number of OS threads.
I think would make sense to have a separate option that controls the number
of capabilities. Then you could set that to eg 2, and set a large -J value,
in order to have git-annex p2phttp allow serving a large number of concurrent
requests, threaded on only 2 cores.
Also, it does not seem to make sense for the default number of capabilities, with a high -J value, to exceed the number of cores. As you noticed, each capability uses some FDs, for eventfd, eventpoll, I'm not sure what else.