Would you rather that git checkout
got a lot faster at checking out a lot
of files, and git add
got a lot faster at adding a lot of small files, if
the tradeoff was that git add
and git commit -a
got slower at adding
large files to the annex than they are now?
Being able to make that choice is what I'm working on now. Of course, we'd rather it were all fast, but due to git smudge clean interface suboptiomal, that is not possible without improvements to git. But I seem to have a plan that will work around enough of the problems to let that choice be made.
Today I've been laying the groundwork, by implementing git's
pkt-line interface, and the long-running filter process protocol.
Next step will be to add support for that in git-annex smudge
,
so that users who want to can enable it with:
git config filter.annex.process 'git-annex filter-process'
I can imagine that becoming enabled by default at some point in v9, if most users prefer it over the current method. Which would still be available by unsetting the config.
Today's work was sponsored by Mark Reidenbach on Patreon
If you have all locked files, git will never run the smudge/clean filter, so its performance will not matter to you. If you have non-annexed files in git, it does run the filter on those though, and this will improve that case.
Not when
annex.supportunlocked=false
, right? (Of course, that's not the default setting, so the smudge filter changes will still help by default.)Right, annex.supportunlocked avoids all these performance considerations.
But, it seems possible that enabling filter-process will speed it up enough to not need annex.supportunlocked. I'd be interested to know if that works for you.
Until this is ready, at the moment, I got around the slowness by setting
This way the file size is ignored and instead git makes the decision for me based on file extension. This is much faster than running every file through
git annex smudge
. I'I might end up sticking with this even if you use the faster filter API.
This is probably waaaaay too late but I wanted to chime in with my 2 cents to the question from Joey.
My personal preference would be to optimize for large files as opposed to many small files. My reasoning is that we can work around the many small files being slow issue by tar-ing them. This is also IMO usually the better option (at least for me) since we rarely have use cases where we have a gajillion small files that directly need to be accessible.
Scientific datasets often do have many small files, e.g. the results of computing something for many subsets of data under many combinations of parameter settings. The files need to be directly accessible so that downstream analysis tools can read them.
This has been implemented for a while, see day 642 cost model for how I got around the dilemma. Upgrade to repository v9 to use it.