withExclusiveLock blocking issue

Some parts of git-annex wait for an exclusive lock, and once they take it, hold it while performing an operation. Now consider what happens if the git-annex process is suspended. Another git-annex process that is running and that waits to take the same exclusive lock (or a shared lock of the same file) will stall forever, until the git-annex process is resumed.

These time windows tend to be small, but may not always be.

Is there any better way git-annex could handle this? Is it a significant problem at all? I don't think I've ever seen it happen, but I rarely ^Z git-annex either. How do other programs handle this, if at all? --Joey

Would it be better for the second git-annex process, rather than hanging indefinitely, to timeout after a few seconds?

But how many seconds? What if the system is under heavy load?

What could be done is, update the lock's file's mtime after successfully taking the lock. Then, as long as the mtime is advancing, some other process is actively using it, and it's ok for our process to wait longer.

(Updating the mtime would be a problem when locking annex object files in v9 and earlier. Luckily, that locking is not done with a blocking lock anyway.)

If the lock file's mtime is being checked, the process that is blocking with the lock held could periodically update the mtime. A background thread could manage that. If that's done every ten seconds, then an mtime more than 20 seconds old indicates that the lock is held by a suspended process. So git-annex would stall for up to 20-30 seconds before erroring out when a lock is held by a suspended process. That seems acceptible, it doesn't need to deal with this situation instantly, it just needs to not block indefinitely. And updating the mtime every 10 seconds should not be too much IO.

When an old version of git-annex has the lock held, it won't be updating the mtime. So if it takes longer than 10 seconds to do the operation with the lock held, a new version may complain that it's suspended when it's really not. This could be avoided by checking what process holds the lock, and whether it's suspended. But probably 10 seconds is enough time for all the operations git-annex takes a blocking lock for currently to finish, and if so we don't need to worry about this situation?

Unfortunately not: importKeys takes an exclusive lock and holds it while downloading all the content! This seems like a bug though, because it can cause other git-annex processes that are eg storing content in a remote to block for a long time.

Another one is Database.Export.writeLockDbWhile, which takes an exclusive lock while running eg, Command.Export.changeExport, which may sometimes need to do a lot of work.

Another one is Annex.Queue.flush, which probably mostly runs in under 10 seconds, but maybe not always, and when annex.queuesize is changed, could surely take longer.

To avoid problems when old git-annex's are also being used, it could update and check the mtime of a different file than the lock file.

Start by trying to take the lock for up to 10 seconds. If it takes the lock, create the mtime file and start a thread that updates the mtime every 10 seconds until the lock is closed, and delete the mtime file before closing the lock handle.

When it times out taking the lock, if the mtime file does not exist, an old git-annex has the lock; if the mtime file does exist, then check if its timestamp has advanced; if not then a new git-annex has the lock and is suspended and it can error out.

Oops: There's a race in the method above; a timeout may occur right when the other process has taken the lock, but has not updated the mtime file yet. Then that process would incorrectly be treated as an old git-annex process.

So: To support old git-annex, it seems it will need to check, when the lock is held, what process has the lock. And then check if that process is suspended or not. Which means looking in /proc. Ugh.

Or: Change to checking lock mtimes only in git-annex v11..