Add an option to give git-annex a path to a RAM disk, and an option to set the maximum space to be used there. git-annex often knows the size of the files it is downloading, since it's part of the key, so can determine in advance if a tempfile of that size would fit on the RAM disk. One could instead symlink .git/annex/tmp/
to a RAM disk, but this could cause memory overflow if a large file is transferred.
Related: keep git-annex branch checked out?, transitive transfers
What benefit would that give?
When the transfer is complete, the file will be moved over to
.git/annex/objects
. On the same file system, that's a simple operation; across file systems, that's effectively a copy.In both cases, the file gets written to disk once. In the original case, it's up to the operating system when to start writing the data to disk (that is, unless the file is flushed by git-annex, which I don't have reason to assume it does). With a RAM disk inbetween, the file would be copied only when it's transferred completely (and then needs to be moved once more to not show up as an incomplete file at its final location). With the original setup, if the operating system has RAM to spare, it can do roughly that already (not start writing until the file is closed). When it's under pressure, it will flush the file out as soon as possible.
Is there any performance issue you see that'd be solved using the RAM disk? If so, that might be indicative of something git-annex can do without starting to mount around (eg. remove any syncs / flushes that sneaked into the tempfile saving process, or use fallocate to tell the OS of the size to come).
.git/annex/objects
: e.g. when chunking is used along with parallel downloads, chunks might go into separate temp files before being merged. I was also thinking of use cases from let external remotes declare support for named pipes, like git-annex-cat, where key contents is processed but not saved.The chunks case should fold into the original one if git-annex merges the chunks using ioctl_ficlonerange, but admittedly that is a) not portable (but neither is mounting a RAM-disk) and b) will only work on some file systems.
I don't understand the applications in named pipes well enough to comment there (will have to read up a bit).
But more generally, I'd gut-feeling-expect that if all is properly advertised (possibly by a fcntl, but RWH_WRITE_LIFE_SHORT doesn't quite seem to be it) and no fsyncs are sent (like eatmydata does), any file should behave like that until a file system action is performed that forces it to be committed to disk -- or the kernel decides that it'd better use that RAM for something else, but that's what it can probably do best.
I'm not sure the approach of screening (and possibly patching) data producers to not fsync (on some systems, closing might be an issue too, and that's where it gets more complex) is better than putting things to a RAM disk, I just think it's an alternative worth exploring.
When git-annex downloads chunks, it downloads one chunk at a time (no parallelisation downloads of chunks of the same key) to either a temp file or a memory buffer, decrypts if necessary, and then appends the chunk to the destination file.
Since chunks are often stored entirely in ram, the chunk size is typically a small fraction of ram. It seems unlikely to me that the kernel would often decide to unncessarily flush a small write to a temp file out to disk and drop it from the cache when the very next operation after writing the file is reading it back in.
chrysn's analysis seems right.
Also, this smells of premature optimisation, and tying it to features that have not even been agreed on, let alone implemented, makes it kind of super low priority?