Windows: substantial per-file cost for `add`

Please describe the problem.

Performance on Windows 10.0.19041 even on small text files is 10+ times slower than Git alone, and seems to have dropped over the past months of (windows, git, or git-annex developments).

What steps will reproduce the problem?

Compare performance of git add and git commit in a Git repo with vs without git annex init.

The observations below have been taken from this DataLad issue. The hardware of the test system is described over there, but only the relative performance is relevant here.

git init: 0.1s
git annex init: 5.7s

First timing info is the command executed in an annex-repo (after git annex init), the second timing is the same command executed in a plain Git repo.

after creating a 3-byte text file:

git add file: 1.9s (0.1s)
git commit file -m msg: 3.2s (0.1s)

after creating two new 3-byte test files:

git add .: 4.5s (0.1s)
git commit -m msg: 2.4s (0.1s)

after creating eight more 3-byte text files:

git add .: 13.1s (0.1s)
git commit -m msg: 1.6 (0.1s)

now adding a 225 MB binary file

git add binfiile: 16s (13s)
git commit -m msg: 2s (0.2s)

no changes

git commit --amend -m msg: 2s (0.1s)

So it looks as if there is a substantial per-file cost of add in an annex repo that is not explained by the underlying Git repo performance.

What version of git-annex are you using? On what operating system?

Windows 10.0.19041
Git 2.33.0.windows.2
git-annex 8.20210804-g1d3f59a9d

Please provide any additional information below.

The linked datalad issue has more information on the configuration of the Git installation, but that only seems to affect performance broadly, and not git-annex specifically.

I am reporting this behavior now, because it has worsened since I last looked into performance on windows. It is unclear to me, which developments have led to this (also the windows version has progressed since then). However, even back then, it looked like there is a windows-specific performance issues that cannot be explained by the general handling of crippled filesystems or adjusted branches (comparing performance on an NTFS drive mounted on Debian vs a native windows box).

Thanks for git-annex!

fixed (enable filter.annex.process) --Joey

RSS Atom

comment 1

Each git add has to run a new git annex smudge process. git commit will often run it as well. This is discussed in detail in git smudge clean interface suboptiomal.

According to https://github.com/datalad/datalad/issues/5994#issuecomment-923744895 git-annex 8.20201127 had the same performance that you are seeing with the current version. So it cannot be any changes in the past year that impacted git-annex performance.

But https://github.com/datalad/datalad/issues/5317 seems like it would have been using around the same version of git-annex, and the timing there was much faster. Unfortunatly the issue does not say what version of git-annex was used, but it seems likely it would have been around 8.20201127 since that was released 2 months prior.

So are the timings in those issues comparable, or is it an apples/oranges comparison with different hardware or OS version?

And if the timings between those issues are not comparable, what exactly is this new issue comparing with when it says the peformance has worsened?

Comment by joey — Tue Sep 21 14:29:12 2021

Remove comment

Performance stats on crippled Debian NTFS-mount

I am responding to your comments in an altered order:

So are the timings in those issues comparable, or is it an apples/oranges comparison with different hardware or OS version? And if the timings between those issues are not comparable, what exactly is this new issue comparing with when it says the performance has worsened?

The timing are comparable insofar as they come from the same hardware and "same" OS, but that OS has seen various updates over time, hence cannot be considered constant.

Each git add has to run a new git annex smudge process. git commit will often run it as well. This is discussed in detail in git smudge clean interface suboptiomal.

Yes, I understand that this is a limitation that is outside the control of git-annex. However, the difference between windows (with implied crippling) and a crippled FS mounted on a proper OS are substantial. Here are a bunch of stats, matching those above.

From https://github.com/datalad/datalad/issues/5317#issuecomment-760767158 a description on how the real windows test system used above relates to my laptop used for the stats below:

The machine is a little older (1 year older than the machine I used for the stats posted before), but not a cripple (Core i5, two physical cores at 2.6GHz, 8GB RAM, mSATA SSD, uptodate bios from 2019, and an uptodate win10).

Here I am using a Core i7-6500U CPU @ 2.50GHz, 8GB RAM, mSATA SSD (via USB3). git 2.33.0, git annex version 8.20210903:

For convenience, I am putting the values from my original post with the stats computed on windows in brackets.

git init: 0.01s [0.1s]
git annex init: 0.1s [5.7s]

First timing info is the command executed in an annex-repo (after git annex init), the second timing is the same command executed in a plain Git repo.

after creating a 3-byte text file:

git add file: 0.05s (0.01s) [1.9s (0.1s)]
git commit file -m msg: 0.08s (0.02s) [3.2s (0.1s)]

after creating two new 3-byte test files:

git add .: 0.08s (0.01s) [4.5s (0.1s)]
git commit -m msg: 0.04s (0.02s) [2.4s (0.1s)]

after creating eight more 3-byte text files:

git add .: 0.36s (0.01s) [13.1s (0.1s)]
git commit -m msg: 0.04s (0.01s) [1.6 (0.1s)]

now adding a 225 MB binary file

git add binfiile: 10.8s (10.9s) [16s (13s)]
git commit -m msg: 1.2s (0.03s) [2s (0.2s)]

no changes

git commit --amend -m msg: 1.3s (0.01s) [2s (0.1s)]

So taken together, we can clearly see the price that needs to be paid for the smudge filter approach. However, it is nowhere near the penalty paid on the real windows system, both in absolute terms, as well as relative to plain git operations. Critically (for me) the total difference for a datalad create amounts to:

25s on a crippled NTFS drive on windows
0.9s on a crippled NTFS drive on Debian

(on otherwise fairly comparable hardware).

I think this indicates that something is slower on windows that cannot be explained by

Git generally being slower on Windows
git-annex generally being slower when juggling smudge filters and adjusted branches
an overall performance difference between my two test machines

Comment by mih — Wed Sep 22 18:37:10 2021

Remove comment

comment 3

Look at the 3 git add results, dividing windows runtime by linux:

1.9/0.05 = 38.00
4.5/0.08 = 56.25
13.1/0.36 = 36.39

The middle is slightly an outlier, and it would be better to have more data points, but what this says to me is it's probably around 38x more expensive on windows than on linux for git-annex smudge --clean to run.

git-annex smudge --clean makes on the order of 4000 syscalls, including opening 200 files, execing git 8 times, and statting 500 files. That's around 10x as many syscalls as git add makes. And it's run once per file. So relatively small differences in syscall performance between windows and linux can add up.

I've looked at just this kind of comparisons before, and it has always seemed explainable by differences in syscall overhead. I don't really see anything in your numbers that says otherwise.

I'm still curious if there's an older version of git-annex that was faster (after it stopped using direct mode in v7). If I've understood correctly, you don't seem to be saying that there is.

If it's always been this slow, then about all I can think to do to improve it is profile git-annex smudge --clean on windows and see if anything other than those syscalls is somehow being slow.

Comment by joey — Wed Sep 22 19:33:58 2021

Remove comment

v7 v v8 comparison

I'm still curious if there's an older version of git-annex that was faster (after it stopped using direct mode in v7). If I've understood correctly, you don't seem to be saying that there is.

Here is some data that seems to support this view.

I dug up another windows box on the higher performance end

windows 11 (not 10!) build 10.0.22000
intel 6-core xeon E-2136 @ 3.3ghz
NVMe drive

All stats done with

git 2.32.0.windows.2

I am reporting two values each, the first for git annex 8.20210715-g7334893d4 and the second for 7.20191107-g8ea269ef7 (the oldest windows build that I could find). Consequently, the first measurement is for a v8 repo, the second for a v7 repo (but no direct mode).

init

git init: 0.1s / 0.1s
git annex init: 3.2s / 3.2s

after creating a 3-byte text file:

git add file: 1.0s / 0.9s
git commit file -m msg: 1.8s / 1.8s

after creating two new 3-byte test files:

git add .: 1.9s / 1.7s
git commit -m msg: 0.9s / 0.9s

(NB: the add increase is caused by unique keys, adding a bunch of identical files is the same as adding one file)

after creating eight more 3-byte text files:

git add .: 6.5s / 6.8s
git commit -m msg: 0.9s / 0.9s

now adding a 280 MB binary file

git add binfiile: 9.8s / 9.9s
git commit -m msg: 1.2s / 1.2s

no changes

git commit --amend -m msg: 1.2s / 1.2s

So there is pretty much no change, and in particular no change attributable to the last ~2 years of git-annex evolution.

However, what remains is that a substantially more capable windows workstation (internal NVMe, faster CPU) comes nowhere near the performance of my 6 year old Debian laptop with an external USB3 drive -- despite having to go through the same smudge filter complications.

I would be grateful, if you could have a look at the windows implementation of smudge --clean. Please let me know what kind of contribution could help to push such an effort. Thanks!

Comment by mih — Thu Sep 23 06:29:30 2021

Remove comment

comment 5

Thank you for verifying this is not a recent speed reversion.

You could use git -c filter.annex.clean= add smallfile to avoid the overhead of the smudge filter. Using git annex add --force-small or git annex add with an annex.largefiles config will also avoid it.

ghc 8.12 has a new IO manager for windows, which I think is likely to be faster. (It avoids the win32 API.) I would want to look at that before other windows-specific optimisations. The windows build is still using lts-16.27 (ghc 8.8.3), there is not yet an lts release with ghc 8.12.

I don't know if there are really any windows-specific optimisations to be had in the smudge code other than such low-level stuff.

Comment by joey — Thu Sep 23 17:22:19 2021

Remove comment

comment 6

I noticed in the strace that smudge --clean ran git cat-file 2 more times than necessary. Also was able to avoid updating the git-annex branch, which eliminates several calls to git (depending on the number of remotes). On Linux, this made it 25% faster. Might be more on Windows.

Rest of the strace looks clean, nothing else stands out as unnecessary.

Comment by joey — Thu Sep 23 17:48:19 2021

Remove comment

comment 7

Here is a time profile, git-annex built with stack build --profile on linux.

        Fri Sep 24 18:18 2021 Time and Allocation Profiling Report  (Final)

           git-annex +RTS -p -RTS smudge --clean x

        total time  =        0.01 secs   (13 ticks @ 1000 us, 1 processor)
        total alloc =   1,732,704 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE                           SRC                                                    %time %alloc

createProcess                  Utility.Process                  Utility/Process.hs:(177,1)-(180,16)                     23.1   16.3
MAIN                           MAIN                             <built-in>                                               7.7    1.5
>>=.\                          Data.Attoparsec.Internal.Types   Data/Attoparsec/Internal/Types.hs:(148,9)-(149,44)       7.7    0.2
searchParser                   Options.Applicative.Common       src/Options/Applicative/Common.hs:(136,1)-(151,40)       7.7    0.9
splitPath.f                    System.FilePath.Posix.ByteString System/FilePath/Posix/../Internal.hs:(857,9)-(862,49)    7.7    1.6
withCatFileMetaDataHandle      Annex.CatFile                    Annex/CatFile.hs:(86,1)-(89,46)                          7.7    0.0
withAltRepo                    Annex.GitOverlay                 Annex/GitOverlay.hs:(107,1)-(124,31)                     7.7    0.1
cmdname                        Types.Command                    Types/Command.hs:80:11-17                                7.7    0.1
stop                           Utility.CoProcess                Utility/CoProcess.hs:(58,1)-(63,46)                      7.7    0.2
relPathCwdToFile               Utility.Path.AbsRel              Utility/Path/AbsRel.hs:(76,1)-(86,36)                    7.7    0.5
getEnv                         Utility.Env                      Utility/Env.hs:32:1-18                                   7.7    0.4

Note that the profiler saw a 0.01 second runtime, but actual runtime is more like 0.04 seconds (actually 0.08 with this profiling build, but non-profiling builds are faster). The rest of the runtime must be linking and RTS startup?

It would be good to get a similar profile on windows for comparison.

Comment by joey — Fri Sep 24 22:21:45 2021

Remove comment

comment 8

It would be good to get a similar profile on windows for comparison.

I tried to create a profile*. Here is what I am seeing on a Windows 10 system, version 2004, OS build 19041.1237:

        Tue Sep 28 14:48 2021 Time and Allocation Profiling Report  (Final)

       git-annex.exe +RTS -p -RTS smudge --clean x

    total time  =        0.00 secs   (1 ticks @ 1000 us, 1 processor)
    total alloc =     956,208 bytes  (excludes profiling overheads)

COST CENTRE               MODULE                               SRC                                                         %time %alloc

CAF                       GHC.IO.Handle.FD                     <entire-module>                                             100.0    5.5
MAIN                      MAIN                                 <built-in>                                                    0.0    1.1
createProcess             Utility.Process                      Utility\Process.hs:(177,1)-(180,16)                           0.0    8.5
relPathCwdToFile          Utility.Path.AbsRel                  Utility\Path\AbsRel.hs:(76,1)-(86,36)                         0.0    2.0
absPath                   Utility.Path.AbsRel                  Utility\Path\AbsRel.hs:(55,1)-(66,45)                         0.0    2.0
checkForRepo.gitSignature Git.Construct                        Git\Construct.hs:234:9-56                                     0.0    2.1
parse.sep                 Git.Config                           Git\Config.hs:(163,9)-(165,38)                                0.0    2.8
needsGitLinkFixup         Annex.Fixup                          Annex\Fixup.hs:(149,1)-(155,34)                               0.0    3.0
clean                     Command.Smudge                       Command\Smudge.hs:(90,1)-(173,60)                             0.0    3.6
parseCmd.subcmds          CmdLine                              CmdLine.hs:109:9-64                                           0.0    3.0
parseCmd.mkcommand        CmdLine                              CmdLine.hs:(110,9)-(113,31)                                   0.0    1.1
cmds                      CmdLine.GitAnnex                     CmdLine\GitAnnex.hs:(132,1)-(244,9)                           0.0    1.3
addGitAnnexGlobalOptions  CmdLine.GitAnnex                     CmdLine\GitAnnex.hs:247:1-97                                  0.0    1.6
splitPath.f.(...)         System.FilePath.Windows.ByteString   System\FilePath\Windows\..\Internal.hs:862:17-49              0.0    1.4
splitPath.f.(...)         System.FilePath.Windows.ByteString   System\FilePath\Windows\..\Internal.hs:861:17-49              0.0    1.8
splitPath.f               System.FilePath.Windows.ByteString   System\FilePath\Windows\..\Internal.hs:(857,9)-(862,49)       0.0    1.7
<>                        Options.Applicative.Builder.Internal src\Options\Applicative\Builder\Internal.hs:(96,3)-(97,47)    0.0    1.7
stepParser                Options.Applicative.Common           src\Options\Applicative\Common.hs:(187,1)-(194,35)            0.0    3.1
searchParser              Options.Applicative.Common           src\Options\Applicative\Common.hs:(136,1)-(151,40)            0.0    1.7
pure                      Options.Applicative.Internal         src\Options\Applicative\Internal.hs:55:3-21                   0.0    1.1
mplus                     Options.Applicative.Internal         src\Options\Applicative\Internal.hs:(215,3)-(219,52)          0.0    1.0
>>=                       Options.Applicative.Internal         src\Options\Applicative\Internal.hs:64:3-50                   0.0    1.3
>>=                       Options.Applicative.Internal         src\Options\Applicative\Internal.hs:(200,3)-(204,54)          0.0    1.2
encodeChar.go             Codec.Binary.UTF8.String             Codec\Binary\UTF8\String.hs:(52,3)-(67,25)                    0.0    1.0
encodeChar                Codec.Binary.UTF8.String             Codec\Binary\UTF8\String.hs:(50,1)-(67,25)                    0.0    2.7
encode                    Codec.Binary.UTF8.String             Codec\Binary\UTF8\String.hs:72:1-29                           0.0    2.3
unconsB                   Codec.Binary.UTF8.Generic            Codec\Binary\UTF8\Generic.hs:297:1-18                         0.0    1.5
foldr                     Data.ByteString.UTF8                 Data\ByteString\UTF8.hs:(171,1)-(173,40)                      0.0    1.1
catch                     Control.Monad.Catch                  src\Control\Monad\Catch.hs:333:3-32                           0.0    1.5




[...]

*I'm not familiar with Haskell, and don't 100% understand what you ran, so I'll detail how I created it in case I did something obvious wrong. I build git-annex at state b9aa2ce8d1 from source using stack Version 2.7.3, Git revision 7927a3aec32e2b2e5e4fb5be76d0d50eddcc197f x86_64 hpack-0.34.4. It used ghc 8.8.4. I ran stack build --profile to build the executable.

I created an empty git/git annex repository - no files committed to Git or git-annex at all. In this repository, I ran git-annex.exe +RTS -p -RTS smudge --clean x (as shown in your profile -- as this command didn't return, I just killed it after a few seconds). The profile above is the first section of the resulting .prof file.

Hope this helps and was the correct procedure. Let me know if I should repeat any step any differently. Thanks!

Comment by adina.wagner — Tue Sep 28 12:53:13 2021

Remove comment

comment 8

Corresponding time profile on Windows. This was run on a github CI instance, so I don't know if the CPU was busy with other tasks.

    Mon Sep 27 22:45 2021 Time and Allocation Profiling Report  (Final)

       git-annex.exe +RTS -p -RTS smudge --clean x

    total time  =        0.02 secs   (22 ticks @ 1000 us, 1 processor)
    total alloc =   6,753,624 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE                     SRC                                                 %time %alloc

createProcess                  Utility.Process            Utility\Process.hs:(177,1)-(180,16)                  50.0   18.7
openLock.\                     Utility.LockFile.Windows   Utility\LockFile\Windows.hs:(64,17)-(65,76)           4.5    0.0
toInodeCache'                  Utility.InodeCache         Utility\InodeCache.hs:(196,1)-(206,34)                4.5    3.0
checkSentinalFile.loadoldcache Utility.InodeCache         Utility\InodeCache.hs:(260,9)-(261,83)                4.5    0.3
adjustGitEnv                   Git.Env                    Git\Env.hs:(18,1)-(22,7)                              4.5   41.2
pipeReadStrict'.go             Git.Command                Git\Command.hs:(80,9)-(85,37)                         4.5    0.1
mergeGitConfig                 Types.GitConfig            Types\GitConfig.hs:(284,1)-(299,54)                   4.5    0.0
fixupRepo                      Annex.Fixup                Annex\Fixup.hs:(37,1)-(42,31)                         4.5    0.0
noCommit                       Command                    Command.hs:56:1-37                                    4.5    0.0
getFileSize                    System.PosixCompat.Files   src\System\PosixCompat\Files.hsc:(358,1)-(359,76)     4.5    1.5
searchParser                   Options.Applicative.Common src\Options\Applicative\Common.hs:(136,1)-(151,40)    4.5    0.2
catch                          Control.Monad.Catch        src\Control\Monad\Catch.hs:333:3-32                   4.5    0.2
getFileSize                    Utility.FileSize           Utility\FileSize.hs:37:1-80                           0.0    1.5
getFileType                    System.PosixCompat.Files   src\System\PosixCompat\Files.hsc:(350,1)-(355,62)     0.0    2.9
getFileStatus                  System.PosixCompat.Files   src\System\PosixCompat\Files.hsc:(312,1)-(337,24)     0.0    6.0
unconsB                        Codec.Binary.UTF8.Generic  Codec\Binary\UTF8\Generic.hs:297:1-18                 0.0    1.2

(Whole profile at https://tmp.joeyh.name/windows-profile)

What stands out is that createProcess is twice as expensive as linux, and fully half the runtime is apparently spent just forking a few processes.

Here are the git processes started and percent of time to start them, from the more detailed profiling:

git config  9%
git cat-file    9%
git check-attr  4.5%
git diff (reconcileStaged)  9%
git show-ref (reconcileStaged)  9%
git write-tree (reconcileStaged)    9%

(The reconcileStaged stuff only happened because git init didn't do it -- I forgot to put git-annex in the path the way I ran it on the CI builder, and so init didn't do everything it usually would. A second run with that fixed had a createProcess percent reduced to 30%, though still at around2 0.2s total runtime.)

This is not super slow on the Windows CI, it's competative with Linux, though my Linux laptop probably has a slower CPU (1.5ghz).

Comment by joey — Tue Sep 28 16:22:36 2021

Remove comment

comment 9

So most of the time is spent in createProcess. Forking is not slow (on linux anyway) so why are 4 createProcess 23% of runtime on linux?

Here is a strace --relative-timestamps on linux, showing a single createProcess call, for reading git config.

 0.000196 pipe([11, 12])            = 0
 0.000134 pipe([13, 14])            = 0
 0.000067 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
 0.000074 vfork()                   = 1412528
 0.000552 close(12)                 = 0
 0.000058 fcntl(11, F_SETFD, FD_CLOEXEC) = 0
 0.000095 close(14)                 = 0
 0.000050 fcntl(13, F_SETFD, FD_CLOEXEC) = 0
 0.000047 read(13, "", 4)           = 0
 0.000050 close(13)                 = 0
 0.000061 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000062 futex(0x98336c8, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000061 futex(0x9833660, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000066 fcntl(11, F_GETFL)        = 0 (flags O_RDONLY)
 0.000047 fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
 0.000073 ioctl(11, TCGETS, 0x7ffc091b0d80) = -1 ENOTTY (Inappropriate ioctl for device)

Total runtime of createProcess above is 1.69 ms. And from the profile on windows, it's taking around 1.7 ms per createProcess.

I'd say that at least 1.32 ms of that is necessary, leaving out the futex and rt_sigprocmask that are probably GHC runtime stuff, and the ioctl and read and fcntl F_GETFL which seem unnecessary. If those were optimised out, the total git-annex smudge --clean runtime would speed up by only 10% or so.

Feels like I've reached the end of profiling. Most of the time is being spent starting git processes, and it can't be sped up significantly without starting fewer git processes.

(I do wish that git check-attr could be removed, but it's needed for the annex.largefiles .gitattributes support.)

Comment by joey — Tue Sep 28 17:09:21 2021

Remove comment

comment 10

Since my profiling above shows windows is as fast as linux (though probably on faster hardware), one thing I am wondering is if antivirus could be slowing it down. I know AV on windows can slow down things like writing to files, because it blocks closing the file until it finishes scanning. Maybe the github windows CI does not run AV, but your windows does, mih?

Comment by joey — Tue Sep 28 17:57:50 2021

Remove comment

comment 12

@adina.wagner wonderful, thanks for the second windows profile..

You need to run git-annex init first, unsure if you did. And you need to create the file x, containing eg "foo", and pipe that file to git-annex smudge --clean x

I think your profile reflects it being stuck waiting for stdin for some time, with the GHC.IO.Handle.FD at the top. Otherwise, it would probably look more like the one I did on window, since createProcess is near the top. It would be good to verify that.

Comment by joey — Tue Sep 28 18:01:17 2021

Remove comment

comment 13

For future reference, I modified the datalad/git-annex workflow to build on windows to do the profiling, here's how the end of that workflow looks:

  - name: Build git-annex
    run: stack install --profile --no-haddock --extra-include-dirs=$PWD --extra-lib-dirs=$PWD --local-bin-path .

  - name: profile
    run: |
      git init profile
      cd profile
      PATH=..:$PATH ../git-annex.exe init
      echo foo > x
      ../git-annex.exe +RTS -p -RTS smudge --clean x
      cat git-annex.prof

Comment by joey — Tue Sep 28 18:13:08 2021

Remove comment

comment 14

And you need to create the file x, containing eg "foo", and pipe that file to git-annex smudge --clean x

Thanks much for clarifying! I did run git annex init, but did not create and pipe the file into the command. Here is the profile after running it correctly:

    Tue Sep 28 21:00 2021 Time and Allocation Profiling Report  (Final)

       git-annex.exe +RTS -p -RTS smudge x --clean

    total time  =        0.01 secs   (13 ticks @ 1000 us, 1 processor)
    total alloc =   5,906,000 bytes  (excludes profiling overheads)

COST CENTRE           MODULE                             SRC                                                %time %alloc

cmdname               Types.Command                      Types\Command.hs:80:11-17                           15.4    0.0
MAIN                  MAIN                               <built-in>                                           7.7    0.1
catchMaybeIO          Utility.Exception                  Utility\Exception.hs:53:1-63                         7.7    0.2
readish               Utility.PartialPrelude             Utility\PartialPrelude.hs:(57,1)-(59,20)             7.7    0.4
createProcess         Utility.Process                    Utility\Process.hs:(177,1)-(180,16)                  7.7    6.9
readInodeCache        Utility.InodeCache                 Utility\InodeCache.hs:(175,1)-(186,20)               7.7    0.6
read                  Git.Config                         Git\Config.hs:(41,1)-(43,33)                         7.7    0.0
CAF:cmd3              Command.Find                       <no location info>                                   7.7    0.0
normalise.dropDots    System.FilePath.Windows.ByteString System\FilePath\Windows\..\Internal.hs:1018:9-34     7.7    0.0
execParserPure.pinfo' Options.Applicative.Extra          src\Options\Applicative\Extra.hs:(136,5)-(138,51)    7.7    0.0
uncons                Data.ByteString.UTF8               Data\ByteString\UTF8.hs:(166,1)-(167,38)             7.7    1.0
catch                 Control.Monad.Catch                src\Control\Monad\Catch.hs:333:3-32                  7.7    0.3
doesPathExist         Utility.RawFilePath                Utility\RawFilePath.hs:80:1-49                       0.0    4.0
getFileSize           Utility.FileSize                   Utility\FileSize.hs:37:1-80                          0.0    3.9
toInodeCache'         Utility.InodeCache                 Utility\InodeCache.hs:(196,1)-(206,34)               0.0    9.7
assertLocal           Git                                Git.hs:(131,1)-(137,28)                              0.0    1.1
clean                 Command.Smudge                     Command\Smudge.hs:(90,1)-(173,60)                    0.0    1.1
escape                Database.Persist.Sqlite            Database\Persist\Sqlite.hs:(605,1)-(610,24)          0.0    1.2
getFileType           System.PosixCompat.Files           src\System\PosixCompat\Files.hsc:(350,1)-(355,62)    0.0    9.4
getFileStatus         System.PosixCompat.Files           src\System\PosixCompat\Files.hsc:(312,1)-(337,24)    0.0   19.4
getFileSize           System.PosixCompat.Files           src\System\PosixCompat\Files.hsc:(358,1)-(359,76)    0.0    3.9
unconsB               Codec.Binary.UTF8.Generic          Codec\Binary\UTF8\Generic.hs:297:1-18                0.0    3.7
foldr                 Data.ByteString.UTF8               Data\ByteString\UTF8.hs:(171,1)-(173,40)             0.0    2.8
decode.choose         Data.ByteString.UTF8               Data\ByteString\UTF8.hs:(72,3)-(78,39)               0.0    1.8
decode                Data.ByteString.UTF8               Data\ByteString\UTF8.hs:(68,1)-(124,32)              0.0    1.3

one thing I am wondering is if antivirus could be slowing it down. I know AV on windows can slow down things like writing to files, because it blocks closing the file until it finishes scanning.

The one I am using doesn't run AV, and I believe @mih's doesn't either.

Comment by adina.wagner — Tue Sep 28 19:19:20 2021

Remove comment

comment 15

cmdname               Types.Command                      Types\Command.hs:80:11-17                           15.4    0.0

I don't trust this latest profile, because cmdname is a a field of a data structure. There's no code there to take 15% of runtime, it's basically following a pointer. I think probably the profiler got confused there.

Anyway, git-annex smudge --clean x < x is running as fast in that profile as it does on linux, and is certainly nowhere near the 1.9s runtime of git add that this bug report is about. I wonder if it also runs that fast for @mih?

Possibly something else is making git add slow.

Comment by joey — Tue Sep 28 20:33:00 2021

Remove comment

comment 16

git add x takes less than 0.1 seconds run on the github windows CI in a git-annex repo.

  - name: profile
    run: |
      git init profile
      cd profile
      PATH=..:$PATH ../git-annex.exe init
      echo foo > x
      date
      PATH=..:$PATH git add x
      date

Output:

ok
(recording state in git...)
Tue Sep 28 21:51:46 CUT 2021
warning: LF will be replaced by CRLF in x.
The file will have its original line endings in your working directory
Tue Sep 28 21:51:46 CUT 2021

So, whatever made it slow for @mih is not a problem here, and I guess not where @adina ran it either.

Comment by joey — Tue Sep 28 21:53:49 2021

Remove comment

comment 17

The new git-annex filter-process should improve this speed a lot. It avoids a new process being started for each file that is added.

git config filter.annex.process 'git-annex filter-process'

That may become the default in v9, or possibly in new v8 repositories.

There is a tradeoff, since git add of a large file to the annex gets slower when it's enabled. Only about 5% in my benchmarks on linux, but maybe more on windows, I don't know.

Comment by joey — Tue Nov 16 16:28:11 2021

Remove comment

Translates to Windows!

Sorry for being silent for so long. I just got a contemporary machine with windows, such that hardware age should no longer be a concern for any performance comparison.

I did not yet find the time to re-assess this issue in full, but I tried the new filter-process setting with a simple datalad create (this only adds a few tiny files, but nevertheless took long, and was the original motivation for this issue). Enabling the new setting reduces the runtime by 25% (from 4.5s to 3.5s on average).

Comment by mih — Thu Nov 25 07:34:48 2021

Remove comment

More statistics

I looked into the global affect of this switch on a large and versatile set of use cases in the form of the DataLad test suite: https://github.com/datalad/datalad/pull/6245

It is worth keeping in mind that there are only small-size files involved!

The benefit is somewhere between noticeable and remarkable. An overall runtime reduction of 16% with benefits ranging from 5% to 32% depending on the tested functionality.

This will need a bit of further investigation to drill down on the reason for this large variability, but given that the sign is always in the right direction this is really great! Thx much!

Comment by mih — Thu Nov 25 13:09:11 2021

Remove comment

Even more impact on real systems

It seems the impact on real systems is even more sever: https://github.com/datalad/datalad/pull/6245#issuecomment-979992551 -- factor 2-3 faster on that particular machine. Maybe consumer systems do some kind of expensive process startup inspection (antivirus?) that bring a swarm of child processes to a crawl.

Comment by mih — Fri Nov 26 14:23:34 2021

Remove comment

comment 21

@mih I've been pondering enabling filter.annex.process by default in new repositories in Windows or generally. (Enabling it in existing repositories kind of needs a new major version so an upgrade can set it, although introducing repository minor versions is also a possibility.)

Enabling it earlier in datalad is fine by me, more experience with it being used would be good.

It would also be useful to get some benchmarks of git add when large files are added to the annex (eg git -c annex.largefiles=anything add). As I said, that suffers around 5% on performance on Linux, at least when the files are small enough to still mostly fit in disk cache (1 gb on a 4 gb system with some web browsers etc running). It may be that Windows will pay a higher price. I don't have real Windows machines to run such a benchmark on myself. Please post any such benchmarks to incremental hashing for add.

Comment by joey — Mon Nov 29 17:17:29 2021

Remove comment

comment 22

Enabling filter.annex.process in new repositories would unfortunately break using that repository with an older version of git-annex before that was added to it. So it seems that git-annex init cannot do it for annex.version 8 and will need to wait for v9.

Of course, if you are sure users of a repository will not be using an old version of git-annex, it's fine to enable it.

I'm going to close this bug now, since ?v9 changes is open to get it enabled by default in v9.

Comment by joey — Tue Jan 11 17:24:49 2022

Remove comment

comment 23

v9 is implemented, and enables filter-process. Not yet the default, but will be eventually.

Comment by joey — Fri Jan 21 17:04:11 2022

Remove comment

Add a comment