day 641 an alternative smudge filter

Would you rather that git checkout got a lot faster at checking out a lot of files, and git add got a lot faster at adding a lot of small files, if the tradeoff was that git add and git commit -a got slower at adding large files to the annex than they are now?

Being able to make that choice is what I'm working on now. Of course, we'd rather it were all fast, but due to git smudge clean interface suboptiomal, that is not possible without improvements to git. But I seem to have a plan that will work around enough of the problems to let that choice be made.

Today I've been laying the groundwork, by implementing git's pkt-line interface, and the long-running filter process protocol. Next step will be to add support for that in git-annex smudge, so that users who want to can enable it with:

git config filter.annex.process 'git-annex filter-process'

I can imagine that becoming enabled by default at some point in v9, if most users prefer it over the current method. Which would still be available by unsetting the config.

Today's work was sponsored by Mark Reidenbach on Patreon

RSS Atom

comment 1

These changes would not affect locked-files-only repos, correct?

Comment by Ilya_Shlyakhter — 3 years and 7 months ago

Remove comment

comment 2

If you have all locked files, git will never run the smudge/clean filter, so its performance will not matter to you. If you have non-annexed files in git, it does run the filter on those though, and this will improve that case.

Comment by joey — 3 years and 7 months ago

Remove comment

clarification re: smudge filter and annex.supportunlocked

If you have non-annexed files in git, it does run the filter on those

Not when annex.supportunlocked=false, right? (Of course, that's not the default setting, so the smudge filter changes will still help by default.)

Comment by Ilya_Shlyakhter — 3 years and 7 months ago

Remove comment

comment 4

Right, annex.supportunlocked avoids all these performance considerations.

But, it seems possible that enabling filter-process will speed it up enough to not need annex.supportunlocked. I'd be interested to know if that works for you.

Comment by joey — 3 years and 7 months ago

Remove comment

comment 5

Until this is ready, at the moment, I got around the slowness by setting

cat > .gitattributes <<EOF
*.mkv   filter=anex annex.largefiles=anything
*.mp3   filter=anex annex.largefiles=anything
EOF

This way the file size is ignored and instead git makes the decision for me based on file extension. This is much faster than running every file through git annex smudge. I'

I might end up sticking with this even if you use the faster filter API.

Comment by nick.guenther — 2 years and 12 months ago

Remove comment

comment 6

This is probably waaaaay too late but I wanted to chime in with my 2 cents to the question from Joey.

My personal preference would be to optimize for large files as opposed to many small files. My reasoning is that we can work around the many small files being slow issue by tar-ing them. This is also IMO usually the better option (at least for me) since we rarely have use cases where we have a gajillion small files that directly need to be accessible.

Comment by prancewit — 2 years and 9 months ago

Remove comment

many small files

we rarely have use cases where we have a gajillion small files that directly need to be accessible

Scientific datasets often do have many small files, e.g. the results of computing something for many subsets of data under many combinations of parameter settings. The files need to be directly accessible so that downstream analysis tools can read them.

Comment by Ilya_Shlyakhter — 2 years and 9 months ago

Remove comment

re: many small files

OTOH, I've found myself using only locked annexed files for datasets, and datalad does the same. But there are reasons others might want to use unlocked files for that.

Comment by Ilya_Shlyakhter — 2 years and 9 months ago

Remove comment

comment 9

This has been implemented for a while, see day 642 cost model for how I got around the dilemma. Upgrade to repository v9 to use it.

Comment by joey — 2 years and 9 months ago

Remove comment

Add a comment