projects/datalad/bugs-done/Windows: substantial per-file cost for `add`git-annexhttp://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/git-annexikiwiki2023-01-05T17:30:31Zcomment 1http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_1_8623b98bf21427ae5734d74e2576af5b/joey2023-01-05T17:30:31Z2021-09-21T14:29:12Z
<p>Each <code>git add</code> has to run a new <code>git annex smudge</code>
process. <code>git commit</code> will often run it as well.
This is discussed in detail in
<a href="http://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/">git smudge clean interface suboptiomal</a>.</p>
<p>According to
<a href="https://github.com/datalad/datalad/issues/5994#issuecomment-923744895">https://github.com/datalad/datalad/issues/5994#issuecomment-923744895</a>
git-annex 8.20201127 had the same performance that you are seeing
with the current version. So it cannot be any changes in the past year
that impacted git-annex performance.</p>
<p>But <a href="https://github.com/datalad/datalad/issues/5317">https://github.com/datalad/datalad/issues/5317</a> seems like it would
have been using around the same version of git-annex, and the timing there
was much faster. Unfortunatly the issue does not say what version
of git-annex was used, but it seems likely it would have been around
8.20201127 since that was released 2 months prior.</p>
<p>So are the timings in those issues comparable, or is it an apples/oranges
comparison with different hardware or OS version?</p>
<p>And if the timings between those issues are not comparable, what exactly
is this new issue comparing with when it says the peformance has worsened?</p>
Performance stats on crippled Debian NTFS-mounthttp://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_2_80c5ec6993e8a720fbd1841fdd21e311/mih2023-01-05T17:30:31Z2021-09-22T18:37:10Z
<p>I am responding to your comments in an altered order:</p>
<blockquote><p>So are the timings in those issues comparable, or is it an apples/oranges comparison with different hardware or OS version? And if the timings between those issues are not comparable, what exactly is this new issue comparing with when it says the performance has worsened?</p></blockquote>
<p>The timing are comparable insofar as they come from the same hardware and "same" OS, but that OS has seen various updates over time, hence cannot be considered constant.</p>
<blockquote><p>Each git add has to run a new git annex smudge process. git commit will often run it as well. This is discussed in detail in git smudge clean interface suboptiomal.</p></blockquote>
<p>Yes, I understand that this is a limitation that is outside the control of git-annex. However, the difference between windows (with implied crippling) and a crippled FS mounted on a proper OS are substantial. Here are a bunch of stats, matching those above.</p>
<p>From https://github.com/datalad/datalad/issues/5317#issuecomment-760767158 a description on how the real windows test system used above relates to my laptop used for the stats below:</p>
<blockquote><p>The machine is a little older (1 year older than the machine I used for the stats posted before), but not a cripple (Core i5, two physical cores at 2.6GHz, 8GB RAM, mSATA SSD, uptodate bios from 2019, and an uptodate win10).</p></blockquote>
<p>Here I am using a Core i7-6500U CPU @ 2.50GHz, 8GB RAM, mSATA SSD (via USB3). git 2.33.0, git annex version 8.20210903:</p>
<p>For convenience, I am putting the values from my original post with the stats computed on windows in <em>brackets</em>.</p>
<ul>
<li>git init: 0.01s <em>[0.1s]</em></li>
<li>git annex init: 0.1s <em>[5.7s]</em></li>
</ul>
<p>First timing info is the command executed in an annex-repo (after git annex init), the second timing is the same command executed in a plain Git repo.</p>
<p>after creating a 3-byte text file:</p>
<ul>
<li>git add file: 0.05s (0.01s) <em>[1.9s (0.1s)]</em></li>
<li>git commit file -m msg: 0.08s (0.02s) <em>[3.2s (0.1s)]</em></li>
</ul>
<p>after creating two new 3-byte test files:</p>
<ul>
<li>git add .: 0.08s (0.01s) <em>[4.5s (0.1s)]</em></li>
<li>git commit -m msg: 0.04s (0.02s) <em>[2.4s (0.1s)]</em></li>
</ul>
<p>after creating eight more 3-byte text files:</p>
<ul>
<li>git add .: 0.36s (0.01s) <em>[13.1s (0.1s)]</em></li>
<li>git commit -m msg: 0.04s (0.01s) <em>[1.6 (0.1s)]</em></li>
</ul>
<p>now adding a 225 MB binary file</p>
<ul>
<li>git add binfiile: 10.8s (10.9s) <em>[16s (13s)]</em></li>
<li>git commit -m msg: 1.2s (0.03s) <em>[2s (0.2s)]</em></li>
</ul>
<p>no changes</p>
<ul>
<li>git commit --amend -m msg: 1.3s (0.01s) <em>[2s (0.1s)]</em></li>
</ul>
<p>So taken together, we can clearly see the price that needs to be paid for the smudge filter approach. However, it is nowhere near the penalty paid on the real windows system, both in absolute terms, as well as relative to plain git operations. Critically (for me) the total difference for a <code>datalad create</code> amounts to:</p>
<ul>
<li>25s on a crippled NTFS drive on windows</li>
<li>0.9s on a crippled NTFS drive on Debian</li>
</ul>
<p>(on otherwise fairly comparable hardware).</p>
<p>I think this indicates that something is slower on windows that <em>cannot</em> be explained by</p>
<ul>
<li>Git generally being slower on Windows</li>
<li>git-annex generally being slower when juggling smudge filters and adjusted branches</li>
<li>an overall performance difference between my two test machines</li>
</ul>
comment 3http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_3_85f07d906b836ac968ab877258180c0c/joey2023-01-05T17:30:31Z2021-09-22T19:33:58Z
<p>Look at the 3 git add results, dividing windows runtime by linux:</p>
<pre><code>1.9/0.05 = 38.00
4.5/0.08 = 56.25
13.1/0.36 = 36.39
</code></pre>
<p>The middle is slightly an outlier, and it would be better to have more data
points, but what this says to me is it's probably around 38x more expensive
on windows than on linux for git-annex smudge --clean to run.</p>
<p>git-annex smudge --clean makes on the order of 4000 syscalls, including
opening 200 files, execing git 8 times, and statting 500 files. That's
around 10x as many syscalls as git add makes. And it's run once per file. So
relatively small differences in syscall performance between windows and
linux can add up.</p>
<p>I've looked at just this kind of comparisons before, and it has always
seemed explainable by differences in syscall overhead. I don't
really see anything in your numbers that says otherwise.</p>
<p>I'm still curious if there's an older version of git-annex that was faster
(after it stopped using direct mode in v7).
If I've understood correctly, you don't seem to be saying that there is.</p>
<p>If it's always been this slow, then about all I can think to do to improve
it is profile git-annex smudge --clean on windows and see if anything other
than those syscalls is somehow being slow.</p>
v7 v v8 comparisonhttp://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_4_0d2ab75208877136534040b726c7f97b/mih2023-01-05T17:30:31Z2021-09-23T06:29:30Z
<blockquote><p>I'm still curious if there's an older version of git-annex that was faster (after it stopped using direct mode in v7). If I've understood correctly, you don't seem to be saying that there is.</p></blockquote>
<p>Here is some data that seems to support this view.</p>
<p>I dug up another windows box on the higher performance end</p>
<ul>
<li>windows 11 (not 10!) build 10.0.22000</li>
<li>intel 6-core xeon E-2136 @ 3.3ghz</li>
<li>NVMe drive</li>
</ul>
<p>All stats done with</p>
<ul>
<li>git 2.32.0.windows.2</li>
</ul>
<p>I am reporting two values each, the first for git annex 8.20210715-g7334893d4 and the second for 7.20191107-g8ea269ef7 (the oldest windows build that I could find). Consequently, the first measurement is for a v8 repo, the second for a v7 repo (but no direct mode).</p>
<p>init</p>
<ul>
<li>git init: 0.1s / 0.1s</li>
<li>git annex init: 3.2s / 3.2s</li>
</ul>
<p>after creating a 3-byte text file:</p>
<ul>
<li>git add file: 1.0s / 0.9s</li>
<li>git commit file -m msg: 1.8s / 1.8s</li>
</ul>
<p>after creating two new 3-byte test files:</p>
<ul>
<li>git add .: 1.9s / 1.7s</li>
<li>git commit -m msg: 0.9s / 0.9s</li>
</ul>
<p>(NB: the <code>add</code> increase is caused by unique keys, adding a bunch of identical files is the same as adding one file)</p>
<p>after creating eight more 3-byte text files:</p>
<ul>
<li>git add .: 6.5s / 6.8s</li>
<li>git commit -m msg: 0.9s / 0.9s</li>
</ul>
<p>now adding a 280 MB binary file</p>
<ul>
<li>git add binfiile: 9.8s / 9.9s</li>
<li>git commit -m msg: 1.2s / 1.2s</li>
</ul>
<p>no changes</p>
<ul>
<li>git commit --amend -m msg: 1.2s / 1.2s</li>
</ul>
<p>So there is pretty much no change, and in particular no change attributable to the last ~2 years of git-annex evolution.</p>
<p>However, what remains is that a substantially more capable windows workstation (internal NVMe, faster CPU) comes nowhere near the performance of my 6 year old Debian laptop with an external USB3 drive -- despite having to go through the same smudge filter complications.</p>
<p>I would be grateful, if you could have a look at the windows implementation of <code>smudge --clean</code>. Please let me know what kind of contribution could help to push such an effort. Thanks!</p>
comment 5http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_5_891b532bd662b1198351803b82ed634e/joey2023-01-05T17:30:31Z2021-09-23T17:22:19Z
<p>Thank you for verifying this is not a recent speed reversion.</p>
<p>You could use <code>git -c filter.annex.clean= add smallfile</code>
to avoid the overhead of the smudge filter.
Using <code>git annex add --force-small</code> or <code>git annex add</code> with an
annex.largefiles config will also avoid it.</p>
<p>ghc 8.12 has a new IO manager for windows, which
I think is likely to be faster. (It avoids the win32 API.) I would
want to look at that before other windows-specific optimisations.
The windows build is still using lts-16.27 (ghc 8.8.3), there is not
yet an lts release with ghc 8.12.</p>
<p>I don't know if there are really any windows-specific optimisations to
be had in the smudge code other than such low-level stuff.</p>
comment 6http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_6_3947b44c88820da49d8a2d1afd3595ff/joey2023-01-05T17:30:31Z2021-09-23T17:48:19Z
<p>I noticed in the strace that smudge --clean ran git cat-file 2
more times than necessary. Also was able to avoid updating the git-annex
branch, which eliminates several calls to git (depending on the number of
remotes). On Linux, this made it 25% faster. Might be more on Windows.</p>
<p>Rest of the strace looks clean, nothing else stands out as unnecessary.</p>
comment 7http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_7_1c867ba26de03f921dd347c14b8ec402/joey2023-01-05T17:30:31Z2021-09-24T22:21:45Z
<p>Here is a time profile, git-annex built with <code>stack build --profile</code>
on linux.</p>
<pre><code> Fri Sep 24 18:18 2021 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS smudge --clean x
total time = 0.01 secs (13 ticks @ 1000 us, 1 processor)
total alloc = 1,732,704 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
createProcess Utility.Process Utility/Process.hs:(177,1)-(180,16) 23.1 16.3
MAIN MAIN <built-in> 7.7 1.5
>>=.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(148,9)-(149,44) 7.7 0.2
searchParser Options.Applicative.Common src/Options/Applicative/Common.hs:(136,1)-(151,40) 7.7 0.9
splitPath.f System.FilePath.Posix.ByteString System/FilePath/Posix/../Internal.hs:(857,9)-(862,49) 7.7 1.6
withCatFileMetaDataHandle Annex.CatFile Annex/CatFile.hs:(86,1)-(89,46) 7.7 0.0
withAltRepo Annex.GitOverlay Annex/GitOverlay.hs:(107,1)-(124,31) 7.7 0.1
cmdname Types.Command Types/Command.hs:80:11-17 7.7 0.1
stop Utility.CoProcess Utility/CoProcess.hs:(58,1)-(63,46) 7.7 0.2
relPathCwdToFile Utility.Path.AbsRel Utility/Path/AbsRel.hs:(76,1)-(86,36) 7.7 0.5
getEnv Utility.Env Utility/Env.hs:32:1-18 7.7 0.4
</code></pre>
<p>Note that the profiler saw a 0.01 second runtime, but actual runtime is more
like 0.04 seconds (actually 0.08 with this profiling build, but non-profiling
builds are faster). The rest of the runtime must be linking and RTS startup?</p>
<p>It would be good to get a similar profile on windows for comparison.</p>
comment 8http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_8_4ee757b45706816c3296dd4b29f05fbf/adina.wagner2023-01-05T17:30:31Z2021-09-28T12:53:13Z
<blockquote><p>It would be good to get a similar profile on windows for comparison.</p></blockquote>
<p>I tried to create a profile*. Here is what I am seeing on a Windows 10 system, version 2004, OS build 19041.1237:</p>
<pre><code> Tue Sep 28 14:48 2021 Time and Allocation Profiling Report (Final)
git-annex.exe +RTS -p -RTS smudge --clean x
total time = 0.00 secs (1 ticks @ 1000 us, 1 processor)
total alloc = 956,208 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
CAF GHC.IO.Handle.FD <entire-module> 100.0 5.5
MAIN MAIN <built-in> 0.0 1.1
createProcess Utility.Process Utility\Process.hs:(177,1)-(180,16) 0.0 8.5
relPathCwdToFile Utility.Path.AbsRel Utility\Path\AbsRel.hs:(76,1)-(86,36) 0.0 2.0
absPath Utility.Path.AbsRel Utility\Path\AbsRel.hs:(55,1)-(66,45) 0.0 2.0
checkForRepo.gitSignature Git.Construct Git\Construct.hs:234:9-56 0.0 2.1
parse.sep Git.Config Git\Config.hs:(163,9)-(165,38) 0.0 2.8
needsGitLinkFixup Annex.Fixup Annex\Fixup.hs:(149,1)-(155,34) 0.0 3.0
clean Command.Smudge Command\Smudge.hs:(90,1)-(173,60) 0.0 3.6
parseCmd.subcmds CmdLine CmdLine.hs:109:9-64 0.0 3.0
parseCmd.mkcommand CmdLine CmdLine.hs:(110,9)-(113,31) 0.0 1.1
cmds CmdLine.GitAnnex CmdLine\GitAnnex.hs:(132,1)-(244,9) 0.0 1.3
addGitAnnexGlobalOptions CmdLine.GitAnnex CmdLine\GitAnnex.hs:247:1-97 0.0 1.6
splitPath.f.(...) System.FilePath.Windows.ByteString System\FilePath\Windows\..\Internal.hs:862:17-49 0.0 1.4
splitPath.f.(...) System.FilePath.Windows.ByteString System\FilePath\Windows\..\Internal.hs:861:17-49 0.0 1.8
splitPath.f System.FilePath.Windows.ByteString System\FilePath\Windows\..\Internal.hs:(857,9)-(862,49) 0.0 1.7
<> Options.Applicative.Builder.Internal src\Options\Applicative\Builder\Internal.hs:(96,3)-(97,47) 0.0 1.7
stepParser Options.Applicative.Common src\Options\Applicative\Common.hs:(187,1)-(194,35) 0.0 3.1
searchParser Options.Applicative.Common src\Options\Applicative\Common.hs:(136,1)-(151,40) 0.0 1.7
pure Options.Applicative.Internal src\Options\Applicative\Internal.hs:55:3-21 0.0 1.1
mplus Options.Applicative.Internal src\Options\Applicative\Internal.hs:(215,3)-(219,52) 0.0 1.0
>>= Options.Applicative.Internal src\Options\Applicative\Internal.hs:64:3-50 0.0 1.3
>>= Options.Applicative.Internal src\Options\Applicative\Internal.hs:(200,3)-(204,54) 0.0 1.2
encodeChar.go Codec.Binary.UTF8.String Codec\Binary\UTF8\String.hs:(52,3)-(67,25) 0.0 1.0
encodeChar Codec.Binary.UTF8.String Codec\Binary\UTF8\String.hs:(50,1)-(67,25) 0.0 2.7
encode Codec.Binary.UTF8.String Codec\Binary\UTF8\String.hs:72:1-29 0.0 2.3
unconsB Codec.Binary.UTF8.Generic Codec\Binary\UTF8\Generic.hs:297:1-18 0.0 1.5
foldr Data.ByteString.UTF8 Data\ByteString\UTF8.hs:(171,1)-(173,40) 0.0 1.1
catch Control.Monad.Catch src\Control\Monad\Catch.hs:333:3-32 0.0 1.5
[...]
</code></pre>
<p>*I'm not familiar with Haskell, and don't 100% understand what you ran, so I'll detail how I created it in case I did something obvious wrong.
I build git-annex at state b9aa2ce8d1 from source using stack Version 2.7.3, Git revision 7927a3aec32e2b2e5e4fb5be76d0d50eddcc197f x86_64 hpack-0.34.4. It used ghc 8.8.4.
I ran <code>stack build --profile</code> to build the executable.</p>
<p>I created an empty git/git annex repository - no files committed to Git or git-annex at all. In this repository, I ran <code>git-annex.exe +RTS -p -RTS smudge --clean x</code> (as shown in your profile -- as this command didn't return, I just killed it after a few seconds). The profile above is the first section of the resulting <code>.prof</code> file.</p>
<p>Hope this helps and was the correct procedure. Let me know if I should repeat any step any differently. Thanks! <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
comment 8http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_8_b46545c78628379dd50d306c87550b9d/joey2023-01-05T17:30:31Z2021-09-28T16:22:36Z
<p>Corresponding time profile on Windows. This was run on a github CI
instance, so I don't know if the CPU was busy with other tasks.</p>
<pre><code> Mon Sep 27 22:45 2021 Time and Allocation Profiling Report (Final)
git-annex.exe +RTS -p -RTS smudge --clean x
total time = 0.02 secs (22 ticks @ 1000 us, 1 processor)
total alloc = 6,753,624 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
createProcess Utility.Process Utility\Process.hs:(177,1)-(180,16) 50.0 18.7
openLock.\ Utility.LockFile.Windows Utility\LockFile\Windows.hs:(64,17)-(65,76) 4.5 0.0
toInodeCache' Utility.InodeCache Utility\InodeCache.hs:(196,1)-(206,34) 4.5 3.0
checkSentinalFile.loadoldcache Utility.InodeCache Utility\InodeCache.hs:(260,9)-(261,83) 4.5 0.3
adjustGitEnv Git.Env Git\Env.hs:(18,1)-(22,7) 4.5 41.2
pipeReadStrict'.go Git.Command Git\Command.hs:(80,9)-(85,37) 4.5 0.1
mergeGitConfig Types.GitConfig Types\GitConfig.hs:(284,1)-(299,54) 4.5 0.0
fixupRepo Annex.Fixup Annex\Fixup.hs:(37,1)-(42,31) 4.5 0.0
noCommit Command Command.hs:56:1-37 4.5 0.0
getFileSize System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(358,1)-(359,76) 4.5 1.5
searchParser Options.Applicative.Common src\Options\Applicative\Common.hs:(136,1)-(151,40) 4.5 0.2
catch Control.Monad.Catch src\Control\Monad\Catch.hs:333:3-32 4.5 0.2
getFileSize Utility.FileSize Utility\FileSize.hs:37:1-80 0.0 1.5
getFileType System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(350,1)-(355,62) 0.0 2.9
getFileStatus System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(312,1)-(337,24) 0.0 6.0
unconsB Codec.Binary.UTF8.Generic Codec\Binary\UTF8\Generic.hs:297:1-18 0.0 1.2
</code></pre>
<p>(Whole profile at <a href="https://tmp.joeyh.name/windows-profile">https://tmp.joeyh.name/windows-profile</a>)</p>
<p>What stands out is that createProcess is twice as expensive as linux, and
fully half the runtime is apparently spent just forking a few processes.</p>
<p>Here are the git processes started and percent of time to start them,
from the more detailed profiling:</p>
<pre><code>git config 9%
git cat-file 9%
git check-attr 4.5%
git diff (reconcileStaged) 9%
git show-ref (reconcileStaged) 9%
git write-tree (reconcileStaged) 9%
</code></pre>
<p>(The reconcileStaged stuff only happened because <code>git init</code>
didn't do it -- I forgot to put git-annex in the path the way I ran it
on the CI builder, and so init didn't do everything it usually would.
A second run with that fixed had a createProcess percent reduced to 30%,
though still at around2 0.2s total runtime.)</p>
<p>This is not super slow on the Windows CI, it's competative with Linux,
though my Linux laptop probably has a slower CPU (1.5ghz).</p>
comment 9http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_9_409a95cb01a4b26d46d7aaff042026a0/joey2023-01-05T17:30:31Z2021-09-28T17:09:21Z
<p>So most of the time is spent in createProcess. Forking is not slow
(on linux anyway) so why are 4 createProcess 23% of runtime on linux?</p>
<p>Here is a strace --relative-timestamps on linux, showing a single createProcess
call, for reading <code>git config</code>.</p>
<pre><code> 0.000196 pipe([11, 12]) = 0
0.000134 pipe([13, 14]) = 0
0.000067 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
0.000074 vfork() = 1412528
0.000552 close(12) = 0
0.000058 fcntl(11, F_SETFD, FD_CLOEXEC) = 0
0.000095 close(14) = 0
0.000050 fcntl(13, F_SETFD, FD_CLOEXEC) = 0
0.000047 read(13, "", 4) = 0
0.000050 close(13) = 0
0.000061 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000062 futex(0x98336c8, FUTEX_WAKE_PRIVATE, 1) = 1
0.000061 futex(0x9833660, FUTEX_WAKE_PRIVATE, 1) = 1
0.000066 fcntl(11, F_GETFL) = 0 (flags O_RDONLY)
0.000047 fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
0.000073 ioctl(11, TCGETS, 0x7ffc091b0d80) = -1 ENOTTY (Inappropriate ioctl for device)
</code></pre>
<p>Total runtime of createProcess above is 1.69 ms.
And from the profile on windows, it's taking around 1.7 ms per createProcess.</p>
<p>I'd say that at least 1.32 ms of that is necessary, leaving out the <code>futex</code>
and <code>rt_sigprocmask</code> that are probably GHC runtime stuff, and the <code>ioctl</code>
and <code>read</code> and <code>fcntl F_GETFL</code> which seem unnecessary.
If those were optimised out, the total <code>git-annex smudge --clean</code>
runtime would speed up by only 10% or so.</p>
<p>Feels like I've reached the end of profiling. Most of the time is being spent
starting git processes, and it can't be sped up significantly without
starting fewer git processes.</p>
<p>(I do wish that <code>git check-attr</code> could be removed, but it's needed for the
annex.largefiles .gitattributes support.)</p>
comment 10http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_10_4e3cec5a4ce33dadfd0ce7a8d17bfaa0/joey2023-01-05T17:30:31Z2021-09-28T17:57:50Z
<p>Since my profiling above shows windows is as fast as linux (though
probably on faster hardware), one thing I am wondering is if antivirus
could be slowing it down. I know AV on windows can slow down things like
writing to files, because it blocks closing the file until it finishes
scanning. Maybe the github windows CI does not run AV, but your windows
does, mih?</p>
comment 12http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_12_f1a92680ab7874aa0c102cbf4b8a445a/joey2023-01-05T17:30:31Z2021-09-28T18:01:17Z
<p>@adina.wagner wonderful, thanks for the second windows profile..</p>
<p>You need to run <code>git-annex init</code> first, unsure if you did. And
you need to create the file <code>x</code>, containing eg "foo", and pipe
that file to <code>git-annex smudge --clean x</code></p>
<p>I think your profile reflects it being stuck waiting
for stdin for some time, with the <code>GHC.IO.Handle.FD</code> at the top.
Otherwise, it would probably look more
like the one I did on window, since createProcess is near the
top. It would be good to verify that.</p>
comment 13http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_13_e38a372b6d9f498160369a9c3071939a/joey2023-01-05T17:30:31Z2021-09-28T18:13:08Z
<p>For future reference, I modified the datalad/git-annex workflow to build on windows
to do the profiling, here's how the end of that workflow looks:</p>
<pre><code> - name: Build git-annex
run: stack install --profile --no-haddock --extra-include-dirs=$PWD --extra-lib-dirs=$PWD --local-bin-path .
- name: profile
run: |
git init profile
cd profile
PATH=..:$PATH ../git-annex.exe init
echo foo > x
../git-annex.exe +RTS -p -RTS smudge --clean x
cat git-annex.prof
</code></pre>
comment 14http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_14_26472db382e75e03cf10c6e5770ef42b/adina.wagner2023-01-05T17:30:31Z2021-09-28T19:19:20Z
<blockquote><p>And you need to create the file x, containing eg "foo", and pipe that file to git-annex smudge --clean x</p></blockquote>
<p>Thanks much for clarifying! I did run <code>git annex init</code>, but did not create and pipe the file into the command. Here is the profile after running it correctly:</p>
<pre><code> Tue Sep 28 21:00 2021 Time and Allocation Profiling Report (Final)
git-annex.exe +RTS -p -RTS smudge x --clean
total time = 0.01 secs (13 ticks @ 1000 us, 1 processor)
total alloc = 5,906,000 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
cmdname Types.Command Types\Command.hs:80:11-17 15.4 0.0
MAIN MAIN <built-in> 7.7 0.1
catchMaybeIO Utility.Exception Utility\Exception.hs:53:1-63 7.7 0.2
readish Utility.PartialPrelude Utility\PartialPrelude.hs:(57,1)-(59,20) 7.7 0.4
createProcess Utility.Process Utility\Process.hs:(177,1)-(180,16) 7.7 6.9
readInodeCache Utility.InodeCache Utility\InodeCache.hs:(175,1)-(186,20) 7.7 0.6
read Git.Config Git\Config.hs:(41,1)-(43,33) 7.7 0.0
CAF:cmd3 Command.Find <no location info> 7.7 0.0
normalise.dropDots System.FilePath.Windows.ByteString System\FilePath\Windows\..\Internal.hs:1018:9-34 7.7 0.0
execParserPure.pinfo' Options.Applicative.Extra src\Options\Applicative\Extra.hs:(136,5)-(138,51) 7.7 0.0
uncons Data.ByteString.UTF8 Data\ByteString\UTF8.hs:(166,1)-(167,38) 7.7 1.0
catch Control.Monad.Catch src\Control\Monad\Catch.hs:333:3-32 7.7 0.3
doesPathExist Utility.RawFilePath Utility\RawFilePath.hs:80:1-49 0.0 4.0
getFileSize Utility.FileSize Utility\FileSize.hs:37:1-80 0.0 3.9
toInodeCache' Utility.InodeCache Utility\InodeCache.hs:(196,1)-(206,34) 0.0 9.7
assertLocal Git Git.hs:(131,1)-(137,28) 0.0 1.1
clean Command.Smudge Command\Smudge.hs:(90,1)-(173,60) 0.0 1.1
escape Database.Persist.Sqlite Database\Persist\Sqlite.hs:(605,1)-(610,24) 0.0 1.2
getFileType System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(350,1)-(355,62) 0.0 9.4
getFileStatus System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(312,1)-(337,24) 0.0 19.4
getFileSize System.PosixCompat.Files src\System\PosixCompat\Files.hsc:(358,1)-(359,76) 0.0 3.9
unconsB Codec.Binary.UTF8.Generic Codec\Binary\UTF8\Generic.hs:297:1-18 0.0 3.7
foldr Data.ByteString.UTF8 Data\ByteString\UTF8.hs:(171,1)-(173,40) 0.0 2.8
decode.choose Data.ByteString.UTF8 Data\ByteString\UTF8.hs:(72,3)-(78,39) 0.0 1.8
decode Data.ByteString.UTF8 Data\ByteString\UTF8.hs:(68,1)-(124,32) 0.0 1.3
</code></pre>
<blockquote><p> one thing I am wondering is if antivirus could be slowing it down. I know AV on windows can slow down things like writing to files, because it blocks closing the file until it finishes scanning.</p></blockquote>
<p>The one I am using doesn't run AV, and I believe @mih's doesn't either.</p>
comment 15http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_15_94d2048dcd33712e08f2206a5182b23f/joey2023-01-05T17:30:31Z2021-09-28T20:33:00Z
<pre><code>cmdname Types.Command Types\Command.hs:80:11-17 15.4 0.0
</code></pre>
<p>I don't trust this latest profile, because cmdname is a
a field of a data structure. There's no code there to take 15% of
runtime, it's basically following a pointer.
I think probably the profiler got confused there.</p>
<p>Anyway, <code>git-annex smudge --clean x < x</code> is running as fast in that profile
as it does on linux, and is certainly nowhere near the 1.9s runtime of <code>git add</code>
that this bug report is about. I wonder if it also runs that fast for @mih?</p>
<p>Possibly something else is making git add slow.</p>
comment 16http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_16_0a170534b266ca55afbfcfe191ff18fc/joey2023-01-05T17:30:31Z2021-09-28T21:53:49Z
<p><code>git add x</code> takes less than 0.1 seconds run on the github windows CI in a
git-annex repo.</p>
<pre><code> - name: profile
run: |
git init profile
cd profile
PATH=..:$PATH ../git-annex.exe init
echo foo > x
date
PATH=..:$PATH git add x
date
</code></pre>
<p>Output:</p>
<pre><code>ok
(recording state in git...)
Tue Sep 28 21:51:46 CUT 2021
warning: LF will be replaced by CRLF in x.
The file will have its original line endings in your working directory
Tue Sep 28 21:51:46 CUT 2021
</code></pre>
<p>So, whatever made it slow for @mih is not a problem here, and I guess
not where @adina ran it either.</p>
comment 17http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_17_7db34973cb362bee271545c3c58690b4/joey2023-01-05T17:30:31Z2021-11-16T16:28:11Z
<p>The new <code>git-annex filter-process</code> should improve this speed a lot.
It avoids a new process being started for each file that is added.</p>
<p>git config filter.annex.process 'git-annex filter-process'</p>
<p>That may become the default in v9, or possibly in new v8 repositories.</p>
<p>There is a tradeoff, since <code>git add</code> of a large file to the annex gets
slower when it's enabled. Only about 5% in my benchmarks on linux, but
maybe more on windows, I don't know.</p>
Translates to Windows!http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_18_2e8e840cabbc547ef758568bb2779f9a/mih2023-01-05T17:30:31Z2021-11-25T07:34:48Z
<p>Sorry for being silent for so long. I just got a contemporary machine with windows, such that hardware age should no longer be a concern for any performance comparison.</p>
<p>I did not yet find the time to re-assess this issue in full, but I tried the new filter-process setting with a simple <code>datalad create</code> (this only adds a few tiny files, but nevertheless took long, and was the original motivation for this issue). Enabling the new setting reduces the runtime by 25% (from 4.5s to 3.5s on average).</p>
More statisticshttp://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_19_c89853eca5c16e9dfa4a4acc982ab526/mih2023-01-05T17:30:31Z2021-11-25T13:09:11Z
<p>I looked into the global affect of this switch on a large and versatile set of use cases in the form of the DataLad test suite: https://github.com/datalad/datalad/pull/6245</p>
<p>It is worth keeping in mind that there are only small-size files involved!</p>
<p>The benefit is somewhere between noticeable and remarkable. An overall runtime reduction of 16% with benefits ranging from 5% to 32% depending on the tested functionality.</p>
<p>This will need a bit of further investigation to drill down on the reason for this large variability, but given that the sign is always in the right direction this is really great! Thx much!</p>
Even more impact on real systemshttp://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_20_281ded4caa0dd407431b205ce6746c05/mih2023-01-05T17:30:31Z2021-11-26T14:23:34Z
It seems the impact on real systems is even more sever: https://github.com/datalad/datalad/pull/6245#issuecomment-979992551 -- factor 2-3 faster on that particular machine. Maybe consumer systems do some kind of expensive process startup inspection (antivirus?) that bring a swarm of child processes to a crawl.
comment 21http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_21_cc08c222cc36e3a0462f370df12eb129/joey2023-01-05T17:30:31Z2021-11-29T17:17:29Z
<p>@mih I've been pondering enabling filter.annex.process by default in new
repositories in Windows or generally. (Enabling it in existing repositories
kind of needs a new major version so an upgrade can set it, although introducing
repository minor versions is also a possibility.)</p>
<p>Enabling it earlier in datalad is fine by me, more experience with it being
used would be good.</p>
<p>It would also be useful to get some benchmarks of <code>git add</code> when
large files are added to the annex (eg <code>git -c annex.largefiles=anything
add</code>). As I said, that suffers around 5% on performance on Linux, at least
when the files are small enough to still mostly fit in disk cache (1 gb on
a 4 gb system with some web browsers etc running). It may be
that Windows will pay a higher price. I don't have real Windows machines
to run such a benchmark on myself. Please post any such benchmarks
to <a href="http://git-annex.branchable.com/todo/incremental_hashing_for_add/">incremental hashing for add</a>.</p>
comment 22http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_22_1c001d885bd4d7f60b3260d494406e5c/joey2023-01-05T17:30:31Z2022-01-11T17:24:49Z
<p>Enabling filter.annex.process in new repositories would unfortunately
break using that repository with an older version of git-annex before that
was added to it. So it seems that <code>git-annex init</code> cannot do it for
annex.version 8 and will need to wait for v9.</p>
<p>Of course, if you are sure users of a repository will not be using an old
version of git-annex, it's fine to enable it.</p>
<p>I'm going to close this bug now, since <a href="http://git-annex.branchable.com/todo/v9_changes/">v9 changes</a> is open to get it
enabled by default in v9.</p>
comment 23http://git-annex.branchable.com/projects/datalad/bugs-done/Windows__58___substantial_per-file_cost_for___96__add__96__/comment_23_cedee281d3f6f97499e70dacd0560f70/joey2023-01-05T17:30:31Z2022-01-21T17:04:11Z
<p>v9 is implemented, and enables filter-process. Not yet the default, but
will be eventually.</p>