When git-annex filter-process
is enabled (v9 and above), git add
pipes
the content of files into it, but that's thrown away, and the file is read
again by git-annex to generate a hash. It would improve performance to hash
the content provided via the pipe.
When filter-process is not enabled, git-annex smudge --clean
reads
the file to hash it, then reads it a second time to copy it into
.git/annex/objects. When annex.addunlocked is enabled, git annex add
does the same. It would improve performance to read once and copy and
hash at the same time.
The incrementalhash
branch has a start at implementing this.
I lost steam on this branch when I realized that it would need to
re-implement Annex.Ingest.ingest in order to populate
.git/annex/objects/. And it's not as simple as writing an object file
and moving it into place there, because annex.thin means a hard link should
be made, and if the filesystem supports CoW, that should be used rather
than writing the file again.
A benchmark on Linux showed that git add
of a 1 gb file
is about 5% slower with filter-process enabled than it is
with filter-process disabled. That's due to the piping overhead to
filter-process (git smudge clean interface suboptiomal).
git-annex add
with annex.addunlocked
has similar performance
as git add
with filter-process disabled.
git-annex add
without annex.addunlocked
is about 25% faster than those,
and only reads the file once, but it also does not copy the file, so of
course it's faster, and always will be.
Probably disk cache helps them a fair amount, unless it's too small. So it's not clear how much implementing this would really speed them up.
This does not really affect default configurations.
Performance is only impacted when annex.addunlocked or
annex.largefiles is configured, and in a few cases
where an already annexed file is added by git add
or git commit -a
.
So is the complication of implementing this worth it? Users who
need maximum speed can use git-annex add
.
Following https://git-annex.branchable.com/bugs/Windows58substantial_per-file_cost_for96add96/#comment-7e6fd13d8b60d99cbc5e642e1d76f10e I've run a few quick benchmarks for
git add
with and without the filter.annex.process setting. I'm using a Levono Thinkpad T14 with Windows 10 Pro Version 20H2, OS build 19042.1348. It has a AMD Ryzen 7 PRO 4750U with Radeon Graphics 1.70 GHz and 16GB RAM. It ran on battery when I tested this.Relevant software
git-annex:
Git:
The machine is pretty much brand new, there is no Antivirus, and very little other software other than git and git-annex. It is sometimes moody but today felt fast.
I created files of different sizes from 1MB to 50GB; added them sequentially into the same repo; on a busy machine running the datalad test suite in the background. There are more timings for "with config" because there was an additional global setting I forgot to remove initially, and I only added really large files in a later runs, so I have a few more timings for smaller files.
What I ran
without config:
with config:
example run protocol
Hope this is helpful.
Thanks @adina.wagner that is very useful. Seems reliable enough despite the other workload, to say that the overhead on Windows is around 14%-33%. Interesting that the worst case is with the smallest file.