Context

I'm currently using git annex reinject --known to deduplicate directories containing dozens number of huge (up to 4-13 Gb) files. Let's focus on one example big file.

The file being reinjected is not already available in .git/annex/objects. It will be after git annex reinject --known completes. The file being reinjected is on a different filesystem on the same disk. This might be important.

Time taken to process one file.

It's done in the background on a server and yields a log that shows how much time passes.

It looks like:

reinject my_big_file.dv (7 minutes pass) (checksum...) (20 minutes pass) ok

my_big_file.dv is 8.7G big.

With the USB2 bandwith available, reading that file can take between 7 and 12 minutes.

What happens?

  • 7 minutes is a reasonable time to read the whole file
  • after "checksum..." appears, 20 minutes pass which is a reasonable time to move the file to the partition containing git-annex repository ... or to read it twice?

This looks "mostly reasonable", perhaps a little long.

Source code in Hash.hs says:

mstat <- liftIO $ catchMaybeIO $ getFileStatus file
case (mstat, fast) of
    (Just stat, False) -> do
        filesize <- liftIO $ getFileSize' file stat
        showAction "checksum"
        check <$> hashFile hash file filesize
    _ -> return True

I expected "checksum..." to appear before the checksum is actually computed, and source code appears to confirm that (trying to compensate ignorance of Haskell with knowledge of OCaml, pure functions, closures, functional programming, including C# and reactive programming).

Questions

  • Is it true that checksum is computed after "checksum..." appears?
  • Why do 7 minute pass before "checksum..." appear? What happens?
  • What happens in the 20 minutes after "checksum..." appear and before "ok"?