Please describe the problem.
Performance on Windows 10.0.19041 even on small text files is 10+ times slower than Git alone, and seems to have dropped over the past months of (windows, git, or git-annex developments).
What steps will reproduce the problem?
Compare performance of git add
and git commit
in a Git repo with vs without git annex init
.
The observations below have been taken from this DataLad issue. The hardware of the test system is described over there, but only the relative performance is relevant here.
git init
: 0.1sgit annex init
: 5.7s
First timing info is the command executed in an annex-repo (after git annex init
), the second timing is the same command executed in a plain Git repo.
after creating a 3-byte text file:
git add file
: 1.9s (0.1s)git commit file -m msg
: 3.2s (0.1s)
after creating two new 3-byte test files:
git add .
: 4.5s (0.1s)git commit -m msg
: 2.4s (0.1s)
after creating eight more 3-byte text files:
git add .
: 13.1s (0.1s)git commit -m msg
: 1.6 (0.1s)
now adding a 225 MB binary file
git add binfiile
: 16s (13s)git commit -m msg
: 2s (0.2s)
no changes
git commit --amend -m msg
: 2s (0.1s)
So it looks as if there is a substantial per-file cost of add
in an annex repo that is not explained by the underlying Git repo performance.
What version of git-annex are you using? On what operating system?
- Windows 10.0.19041
- Git 2.33.0.windows.2
- git-annex 8.20210804-g1d3f59a9d
Please provide any additional information below.
The linked datalad issue has more information on the configuration of the Git installation, but that only seems to affect performance broadly, and not git-annex specifically.
I am reporting this behavior now, because it has worsened since I last looked into performance on windows. It is unclear to me, which developments have led to this (also the windows version has progressed since then). However, even back then, it looked like there is a windows-specific performance issues that cannot be explained by the general handling of crippled filesystems or adjusted branches (comparing performance on an NTFS drive mounted on Debian vs a native windows box).
Thanks for git-annex!
Each
git add
has to run a newgit annex smudge
process.git commit
will often run it as well. This is discussed in detail in git smudge clean interface suboptiomal.According to https://github.com/datalad/datalad/issues/5994#issuecomment-923744895 git-annex 8.20201127 had the same performance that you are seeing with the current version. So it cannot be any changes in the past year that impacted git-annex performance.
But https://github.com/datalad/datalad/issues/5317 seems like it would have been using around the same version of git-annex, and the timing there was much faster. Unfortunatly the issue does not say what version of git-annex was used, but it seems likely it would have been around 8.20201127 since that was released 2 months prior.
So are the timings in those issues comparable, or is it an apples/oranges comparison with different hardware or OS version?
And if the timings between those issues are not comparable, what exactly is this new issue comparing with when it says the peformance has worsened?
I am responding to your comments in an altered order:
The timing are comparable insofar as they come from the same hardware and "same" OS, but that OS has seen various updates over time, hence cannot be considered constant.
Yes, I understand that this is a limitation that is outside the control of git-annex. However, the difference between windows (with implied crippling) and a crippled FS mounted on a proper OS are substantial. Here are a bunch of stats, matching those above.
From https://github.com/datalad/datalad/issues/5317#issuecomment-760767158 a description on how the real windows test system used above relates to my laptop used for the stats below:
Here I am using a Core i7-6500U CPU @ 2.50GHz, 8GB RAM, mSATA SSD (via USB3). git 2.33.0, git annex version 8.20210903:
For convenience, I am putting the values from my original post with the stats computed on windows in brackets.
First timing info is the command executed in an annex-repo (after git annex init), the second timing is the same command executed in a plain Git repo.
after creating a 3-byte text file:
after creating two new 3-byte test files:
after creating eight more 3-byte text files:
now adding a 225 MB binary file
no changes
So taken together, we can clearly see the price that needs to be paid for the smudge filter approach. However, it is nowhere near the penalty paid on the real windows system, both in absolute terms, as well as relative to plain git operations. Critically (for me) the total difference for a
datalad create
amounts to:(on otherwise fairly comparable hardware).
I think this indicates that something is slower on windows that cannot be explained by
Look at the 3 git add results, dividing windows runtime by linux:
The middle is slightly an outlier, and it would be better to have more data points, but what this says to me is it's probably around 38x more expensive on windows than on linux for git-annex smudge --clean to run.
git-annex smudge --clean makes on the order of 4000 syscalls, including opening 200 files, execing git 8 times, and statting 500 files. That's around 10x as many syscalls as git add makes. And it's run once per file. So relatively small differences in syscall performance between windows and linux can add up.
I've looked at just this kind of comparisons before, and it has always seemed explainable by differences in syscall overhead. I don't really see anything in your numbers that says otherwise.
I'm still curious if there's an older version of git-annex that was faster (after it stopped using direct mode in v7). If I've understood correctly, you don't seem to be saying that there is.
If it's always been this slow, then about all I can think to do to improve it is profile git-annex smudge --clean on windows and see if anything other than those syscalls is somehow being slow.
Here is some data that seems to support this view.
I dug up another windows box on the higher performance end
All stats done with
I am reporting two values each, the first for git annex 8.20210715-g7334893d4 and the second for 7.20191107-g8ea269ef7 (the oldest windows build that I could find). Consequently, the first measurement is for a v8 repo, the second for a v7 repo (but no direct mode).
init
after creating a 3-byte text file:
after creating two new 3-byte test files:
(NB: the
add
increase is caused by unique keys, adding a bunch of identical files is the same as adding one file)after creating eight more 3-byte text files:
now adding a 280 MB binary file
no changes
So there is pretty much no change, and in particular no change attributable to the last ~2 years of git-annex evolution.
However, what remains is that a substantially more capable windows workstation (internal NVMe, faster CPU) comes nowhere near the performance of my 6 year old Debian laptop with an external USB3 drive -- despite having to go through the same smudge filter complications.
I would be grateful, if you could have a look at the windows implementation of
smudge --clean
. Please let me know what kind of contribution could help to push such an effort. Thanks!Thank you for verifying this is not a recent speed reversion.
You could use
git -c filter.annex.clean= add smallfile
to avoid the overhead of the smudge filter. Usinggit annex add --force-small
orgit annex add
with an annex.largefiles config will also avoid it.ghc 8.12 has a new IO manager for windows, which I think is likely to be faster. (It avoids the win32 API.) I would want to look at that before other windows-specific optimisations. The windows build is still using lts-16.27 (ghc 8.8.3), there is not yet an lts release with ghc 8.12.
I don't know if there are really any windows-specific optimisations to be had in the smudge code other than such low-level stuff.
I noticed in the strace that smudge --clean ran git cat-file 2 more times than necessary. Also was able to avoid updating the git-annex branch, which eliminates several calls to git (depending on the number of remotes). On Linux, this made it 25% faster. Might be more on Windows.
Rest of the strace looks clean, nothing else stands out as unnecessary.
Here is a time profile, git-annex built with
stack build --profile
on linux.Note that the profiler saw a 0.01 second runtime, but actual runtime is more like 0.04 seconds (actually 0.08 with this profiling build, but non-profiling builds are faster). The rest of the runtime must be linking and RTS startup?
It would be good to get a similar profile on windows for comparison.
I tried to create a profile*. Here is what I am seeing on a Windows 10 system, version 2004, OS build 19041.1237:
*I'm not familiar with Haskell, and don't 100% understand what you ran, so I'll detail how I created it in case I did something obvious wrong. I build git-annex at state b9aa2ce8d1 from source using stack Version 2.7.3, Git revision 7927a3aec32e2b2e5e4fb5be76d0d50eddcc197f x86_64 hpack-0.34.4. It used ghc 8.8.4. I ran
stack build --profile
to build the executable.I created an empty git/git annex repository - no files committed to Git or git-annex at all. In this repository, I ran
git-annex.exe +RTS -p -RTS smudge --clean x
(as shown in your profile -- as this command didn't return, I just killed it after a few seconds). The profile above is the first section of the resulting.prof
file.Hope this helps and was the correct procedure. Let me know if I should repeat any step any differently. Thanks!
Corresponding time profile on Windows. This was run on a github CI instance, so I don't know if the CPU was busy with other tasks.
(Whole profile at https://tmp.joeyh.name/windows-profile)
What stands out is that createProcess is twice as expensive as linux, and fully half the runtime is apparently spent just forking a few processes.
Here are the git processes started and percent of time to start them, from the more detailed profiling:
(The reconcileStaged stuff only happened because
git init
didn't do it -- I forgot to put git-annex in the path the way I ran it on the CI builder, and so init didn't do everything it usually would. A second run with that fixed had a createProcess percent reduced to 30%, though still at around2 0.2s total runtime.)This is not super slow on the Windows CI, it's competative with Linux, though my Linux laptop probably has a slower CPU (1.5ghz).
So most of the time is spent in createProcess. Forking is not slow (on linux anyway) so why are 4 createProcess 23% of runtime on linux?
Here is a strace --relative-timestamps on linux, showing a single createProcess call, for reading
git config
.Total runtime of createProcess above is 1.69 ms. And from the profile on windows, it's taking around 1.7 ms per createProcess.
I'd say that at least 1.32 ms of that is necessary, leaving out the
futex
andrt_sigprocmask
that are probably GHC runtime stuff, and theioctl
andread
andfcntl F_GETFL
which seem unnecessary. If those were optimised out, the totalgit-annex smudge --clean
runtime would speed up by only 10% or so.Feels like I've reached the end of profiling. Most of the time is being spent starting git processes, and it can't be sped up significantly without starting fewer git processes.
(I do wish that
git check-attr
could be removed, but it's needed for the annex.largefiles .gitattributes support.)Since my profiling above shows windows is as fast as linux (though probably on faster hardware), one thing I am wondering is if antivirus could be slowing it down. I know AV on windows can slow down things like writing to files, because it blocks closing the file until it finishes scanning. Maybe the github windows CI does not run AV, but your windows does, mih?
@adina.wagner wonderful, thanks for the second windows profile..
You need to run
git-annex init
first, unsure if you did. And you need to create the filex
, containing eg "foo", and pipe that file togit-annex smudge --clean x
I think your profile reflects it being stuck waiting for stdin for some time, with the
GHC.IO.Handle.FD
at the top. Otherwise, it would probably look more like the one I did on window, since createProcess is near the top. It would be good to verify that.For future reference, I modified the datalad/git-annex workflow to build on windows to do the profiling, here's how the end of that workflow looks:
Thanks much for clarifying! I did run
git annex init
, but did not create and pipe the file into the command. Here is the profile after running it correctly:The one I am using doesn't run AV, and I believe @mih's doesn't either.
I don't trust this latest profile, because cmdname is a a field of a data structure. There's no code there to take 15% of runtime, it's basically following a pointer. I think probably the profiler got confused there.
Anyway,
git-annex smudge --clean x < x
is running as fast in that profile as it does on linux, and is certainly nowhere near the 1.9s runtime ofgit add
that this bug report is about. I wonder if it also runs that fast for @mih?Possibly something else is making git add slow.
git add x
takes less than 0.1 seconds run on the github windows CI in a git-annex repo.Output:
So, whatever made it slow for @mih is not a problem here, and I guess not where @adina ran it either.
The new
git-annex filter-process
should improve this speed a lot. It avoids a new process being started for each file that is added.git config filter.annex.process 'git-annex filter-process'
That may become the default in v9, or possibly in new v8 repositories.
There is a tradeoff, since
git add
of a large file to the annex gets slower when it's enabled. Only about 5% in my benchmarks on linux, but maybe more on windows, I don't know.Sorry for being silent for so long. I just got a contemporary machine with windows, such that hardware age should no longer be a concern for any performance comparison.
I did not yet find the time to re-assess this issue in full, but I tried the new filter-process setting with a simple
datalad create
(this only adds a few tiny files, but nevertheless took long, and was the original motivation for this issue). Enabling the new setting reduces the runtime by 25% (from 4.5s to 3.5s on average).I looked into the global affect of this switch on a large and versatile set of use cases in the form of the DataLad test suite: https://github.com/datalad/datalad/pull/6245
It is worth keeping in mind that there are only small-size files involved!
The benefit is somewhere between noticeable and remarkable. An overall runtime reduction of 16% with benefits ranging from 5% to 32% depending on the tested functionality.
This will need a bit of further investigation to drill down on the reason for this large variability, but given that the sign is always in the right direction this is really great! Thx much!
@mih I've been pondering enabling filter.annex.process by default in new repositories in Windows or generally. (Enabling it in existing repositories kind of needs a new major version so an upgrade can set it, although introducing repository minor versions is also a possibility.)
Enabling it earlier in datalad is fine by me, more experience with it being used would be good.
It would also be useful to get some benchmarks of
git add
when large files are added to the annex (eggit -c annex.largefiles=anything add
). As I said, that suffers around 5% on performance on Linux, at least when the files are small enough to still mostly fit in disk cache (1 gb on a 4 gb system with some web browsers etc running). It may be that Windows will pay a higher price. I don't have real Windows machines to run such a benchmark on myself. Please post any such benchmarks to incremental hashing for add.Enabling filter.annex.process in new repositories would unfortunately break using that repository with an older version of git-annex before that was added to it. So it seems that
git-annex init
cannot do it for annex.version 8 and will need to wait for v9.Of course, if you are sure users of a repository will not be using an old version of git-annex, it's fine to enable it.
I'm going to close this bug now, since ?v9 changes is open to get it enabled by default in v9.
v9 is implemented, and enables filter-process. Not yet the default, but will be eventually.