I was looking into why annex get
and annex copy
between local clones on NFS mounts don't utilize NFS4.2's server-side copy feature,
which would be pretty relevant for a setup like ours (institutional compute cluster; big datalad datasets).
This seems to boil down to not calling copy_file_range
. However, cp
generally does call copy_file_range
, so that seemed confusing.
Turns out the culprit is --reflink=always
which does not work as expected on ZFS. --reflink=auto
does, though.
A summary of how that is, can be found here: https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842
I am not sure why annex insists on always
rather than auto
, so not sure whether the solution actually would be to change that.
Reading old issues it seems the reason was to let annex handle the fallback, which kinda is the problem in case of ZFS.
Changing (back?) to auto
may be an issue in other cases - I don't know. If annex' fallback when cp --reflink=always
fails would end up calling copy_file_range
, that would still solve the issue, though, as NFS would then end up performing a server-side copy rather than transferring big files back and forth.
Just to be clear: It's specifically ZFS via NFS on linux that's the issue here.
P.S.: Didn't want to call this a bug, mostly b/c the "real bug" isn't in annex exactly (see link above), so putting it here.
The reason for reflink=always is that git-annex wants it to fail when reflink is not supported and the copy is going to be slow. Then it falls back to copying the file itself, which allows an interrupted copy of a large file to be resumed, rather than restarted from the beginning as cp would do when it's not making a reflink.
So, at first it seemed to me that the solution will need to involve git-annex using
copy_file_range
itself.But, git-annex would like to checksum the file as it's copying it (unless annex.verify is not set), in order to avoid needing to re-read it to hash it after the fact, which would double the disk IO in many cases. Using
copy_file_range
by default would prevent git-annex from doing that.So it needs to either be probed, or be a config setting. And whichever way git-annex determines it, it may as well use
cp reflink=auto
then rather than usingcopy_file_range
itself.I'd certainly rather avoid a config setting if I can. But if this is specific to NFS on ZFS, I don't know what would be a good way to probe for that? Or is this happening on NFS when not on ZFS as well?
The problem is indeed specific to the combination of NFS and ZFS. Not sure how to properly probe for that. But since this is relevant only for copies across ZFS over NFS4.2 mounts, a config option would be OK, I think. It's not that something isn't working w/o it, it's just unnecessarily slow.
A config setting may be unncessesary. If git-annex tried to use
copy_file_range
itself, that would fail with EOPNOTSUPP or EXDEV or EXDEV when not supported. Then git-annex could usecp --reflink=always
as a fallback.However,
copy_file_range
is not necessarily inexpensive. Depending on the filesystem it can still need to read and write the whole file. So when using it, git-annex would need to poll the size of the file in order to update the progress bar. Or it could call the syscall repeatedly on chunks of the file, but on eg NFS that would add a lot of syscalls, so probably more overhead.Also, it seems likely to me that you would certainly want to turn off annex.verify along with using
copy_file_range
, which is already a manual config setting. So a second config setting would be no big deal.As to other filesystems, I found this comment with an overview as of 2022: https://github.com/openzfs/zfs/discussions/4237#discussioncomment-3579635
For btrfs, it does reflinking, so no benefit to using it over what git-annex does now.
Testing on ext4,
cp --reflink=auto
usedcopy_file_range
in a copy on the same filesystem (it tried it cross-filesystem, but it failed and had to fall back to a regulat copy). So doescp
with no options. On a SSD, with big enough files (4 gb or so), I did see noticable performance improvements.If git-annex did
copy_file_range
in chunks on ext4, it could read each chunk after it was written to the destination file, and get it from the page cache. But that would still copy the content of the file into user space. So the savings from usingcopy_file_range
with annex.verify set on ext4 seem like they would only be in avoiding the userspace to kernel transfer, with the kernel to userspace transfer still needed.That also notes that, on NFS,
copy_file_range
can do a CoW copy when the underlying filesystem supports it. So with NFS on btrfs or zfs, a singlecopy_file_range
call could result in no more work than a reflink, optimially efficient. If git-annex didcopy_file_range
on each chunk in order to display a progress bar, that would be a lot of syscalls in flight over the network, so noticably slower.All of this is making me lean toward a config setting that enables
copy_file_range
, without progress bars, and that is intended to be used with annex.verify disabled in order to get optimal performance.Implemented
remote.<name>.annex-fastcopy
andannex.fastcopy
config settings.These do just run
cp --reflink=auto
. So when a copy was interrupted, it won't resume where it left off. It would be possible to improve that, by making git-annex usecopy_file_range
itself, starting at the point it was interrupted. At least for keys where isStableKey is True, since git-annex won't be verifying when using this. But for the NFS use case, if the server is doing a server-side CoW copy, it's atomic anyway, so no need to handle resuming.Since I don't currently know of a haskell binding to
copy_file_range
and since it's a low-level OS-specific thing, I have punted on handling resuming. This could be revisited, I just don't see the benefit in the current use case, and wanted to start from something relatively simple.