I'm currently using
git annex reinject --known to deduplicate directories containing dozens number of huge (up to 4-13 Gb) files.
Let's focus on one example big file.
The file being reinjected is not already available in
.git/annex/objects. It will be after
git annex reinject --known completes.
The file being reinjected is on a different filesystem on the same disk. This might be important.
Time taken to process one file.
It's done in the background on a server and yields a log that shows how much time passes.
It looks like:
reinject my_big_file.dv (7 minutes pass) (checksum...) (20 minutes pass) ok
my_big_file.dv is 8.7G big.
With the USB2 bandwith available, reading that file can take between 7 and 12 minutes.
- 7 minutes is a reasonable time to read the whole file
- after "checksum..." appears, 20 minutes pass which is a reasonable time to move the file to the partition containing git-annex repository ... or to read it twice?
This looks "mostly reasonable", perhaps a little long.
Source code in Hash.hs says:
mstat <- liftIO $ catchMaybeIO $ getFileStatus file case (mstat, fast) of (Just stat, False) -> do filesize <- liftIO $ getFileSize' file stat showAction "checksum" check <$> hashFile hash file filesize _ -> return True
I expected "checksum..." to appear before the checksum is actually computed, and source code appears to confirm that (trying to compensate ignorance of Haskell with knowledge of OCaml, pure functions, closures, functional programming, including C# and reactive programming).
- Is it true that checksum is computed after "checksum..." appears?
- Why do 7 minute pass before "checksum..." appear? What happens?
- What happens in the 20 minutes after "checksum..." appear and before "ok"?