I can put git-annex fsck
in a loop to check a large directory like this:
-S
starts an incremental check, -m
continues the started incremental check, &>>
appends all output (both stdout
and stderr
) into the fsck.log
file.
$ git-annex fsck -S large-directory --from remote-repo --time-limit=60s &>>~/log/fsck.log
#...
#...
#...
$ while (sleep 10); do
git-annex fsck -m large-directory --from remote-repo --time-limit=1h &>>~/log/fsck.log
#...
#...
#...
done;
I need the loop because the connection to remote-repo
fails after some time (or because remote server error) and needs a reconnect, after that, everything is ok.
Suppose, I have many large directories and it would be faster to check them if I could run them parallelly. Many small files, they do not take too much bandwidth but more I/O and network communication.
I know that the progress of fsck
is stored in a database (now after every 1000 files or 5 minutes or --time-limit
) but is the checked directory (large-directory) is taken into account when starting/storing the progress?
Is the checked directory/path in the primary-key? Or is it much more complicated?
If I could start checking many directories in the same time, fsck
would finish much faster (think about thousands of small icon files). Is it just me or somebody else could profit from this?
(This is not a feature request, I would like to know if anybody needs this, if possible at all.)
Thanks, parhuzamos
This is actually something that got worse when fsck changed to using a database, rather than its old sticky-bit based hack. Parallel fsck used to work entirely perfectly, they could even be run on the same directories w/o processing files redundantly.
Now, it's safe to run multiple fsck processes in parallel. However, since the database is only occasionally updated, if the two fscks are working on the same directory, one won't know that the other has already fscked a file, and they'll tend to do redundant work.
It'll work fine if you give concurrent fscks different directories or sets of files to work on. The git-annex key (ie, symlink target of a file) is the primary-key.
Also, I'd at some point like to make git-annex fsck -J work. With concurrent fsck jobs running in the same process, it could easily divide work up amoung them. The only tricky part is the output of the concurrent jobs would be scrambled and interleaved..