option to put temp files on a RAM disk

git-annex/ todo/ option to put temp files on a RAM disk

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Add an option to give git-annex a path to a RAM disk, and an option to set the maximum space to be used there. git-annex often knows the size of the files it is downloading, since it's part of the key, so can determine in advance if a tempfile of that size would fit on the RAM disk. One could instead symlink .git/annex/tmp/ to a RAM disk, but this could cause memory overflow if a large file is transferred.

RSS Atom

Use of the RAM disk

What benefit would that give?

When the transfer is complete, the file will be moved over to .git/annex/objects. On the same file system, that's a simple operation; across file systems, that's effectively a copy.

In both cases, the file gets written to disk once. In the original case, it's up to the operating system when to start writing the data to disk (that is, unless the file is flushed by git-annex, which I don't have reason to assume it does). With a RAM disk inbetween, the file would be copied only when it's transferred completely (and then needs to be moved once more to not show up as an incomplete file at its final location). With the original setup, if the operating system has RAM to spare, it can do roughly that already (not start writing until the file is closed). When it's under pressure, it will flush the file out as soon as possible.

Is there any performance issue you see that'd be solved using the RAM disk? If so, that might be indicative of something git-annex can do without starting to mount around (eg. remove any syncs / flushes that sneaked into the tempfile saving process, or use fallocate to tell the OS of the size to come).

Comment by chrysn — Tue Jan 28 14:03:15 2020

Remove comment

use of RAM disk

You're right in general. There may be cases though, where a temp file doesn't just get moved into .git/annex/objects: e.g. when chunking is used along with parallel downloads, chunks might go into separate temp files before being merged. I was also thinking of use cases from let external remotes declare support for named pipes, like git-annex-cat, where key contents is processed but not saved.

Comment by Ilya_Shlyakhter — Tue Jan 28 17:23:28 2020

Remove comment

Re: use of RAM disk

The chunks case should fold into the original one if git-annex merges the chunks using ioctl_ficlonerange, but admittedly that is a) not portable (but neither is mounting a RAM-disk) and b) will only work on some file systems.

I don't understand the applications in named pipes well enough to comment there (will have to read up a bit).

But more generally, I'd gut-feeling-expect that if all is properly advertised (possibly by a fcntl, but RWH_WRITE_LIFE_SHORT doesn't quite seem to be it) and no fsyncs are sent (like eatmydata does), any file should behave like that until a file system action is performed that forces it to be committed to disk -- or the kernel decides that it'd better use that RAM for something else, but that's what it can probably do best.

I'm not sure the approach of screening (and possibly patching) data producers to not fsync (on some systems, closing might be an issue too, and that's where it gets more complex) is better than putting things to a RAM disk, I just think it's an alternative worth exploring.

Comment by chrysn — Wed Jan 29 07:49:08 2020

Remove comment

comment 4

When git-annex downloads chunks, it downloads one chunk at a time (no parallelisation downloads of chunks of the same key) to either a temp file or a memory buffer, decrypts if necessary, and then appends the chunk to the destination file.

Since chunks are often stored entirely in ram, the chunk size is typically a small fraction of ram. It seems unlikely to me that the kernel would often decide to unncessarily flush a small write to a temp file out to disk and drop it from the cache when the very next operation after writing the file is reading it back in.

chrysn's analysis seems right.

Also, this smells of premature optimisation, and tying it to features that have not even been agreed on, let alone implemented, makes it kind of super low priority?