git smudge clean interface suboptiomal

git-annex uses git's smudge/clean interface to implement v6 unlocked files. However, the interface is suboptimal for git-annex's needs. While git-annex works around most of the problems with the interface, it can't avoid some consequences of this poor fit, and it has to do some surprising things to make it work as well as it does.

First, how git's smudge/clean interface is meant to work: The smudge filter is run on the content of files as stored in a repo before they are written to the work tree, and can alter the content in arbitrary ways. The clean filter reverses the smudge filter, so git can use it to get the content to store in the repo. See gitattributes(5) for details.

It was originally used for minor textual changes (eg line ending conversion), but it's general enough to be used to add large file support to git. git-lfs uses it that way.

The first problem with the interface was that it ran a command once per file. This was later fixed by extending it to support long-running filter processes, which git-lfs uses. git-annex can also use that interface, when git-annex filter-process is enabled. That is the case in v9 repositories and above.

A second problem with the interface, which affects git-lfs AFAIK, is that git buffers the output of the smudge filter in memory before updating the working tree. If the smudge filter emits a large file, git can use a lot of memory. Of course, on modern computers this needs to be hundreds of megabytes to be very noticable. git-lfs may tend to be used with files not that large. git-annex avoids this problem by not using the smudge filter in the usual way, as described below.

A third problem with the interface is that piping large file contents between git and filters is innefficient. Seems this must affect git-lfs too, but perhaps it's used on less enourmous data sets than git-annex.

To avoid the problem, git-annex smudge --clean relies on a not very well documented trick: It is fed a possibly large file on stdin, but when it closes the FD without reading. git gets a SIGPIPE and stops reading and sending the file. Instead of reading from stdin, git-annex abuses the fact that git provides the clean filter with the work tree filename, and reads and cleans the file itself, more efficiently.

git-lfs differs from git-annex in that all the large files in the repository are usually present in the working tree; it doesn't have a way to drop content that is not wanted locally while keeping other content locally available, as git-annex does. And so it does not need to be able to get content like git-annex can do either. It also differs in that it uses a central server, which is trusted to retain content, so it doesn't try to avoid losing the local copy, which could be the only copy, as git-annex does. (All AFAIK; have not looked at git-lfs recently.)

Those properties of git-lfs make it fit fairly well into the smudge/clean interface. Conversely, the different properties of git-annex make it a poor fit.

git-annex needs to be able to update the working tree itself, to make large file content available or not available. But this would cause git to think the file is modified.

The way git-annex works around this is to run git update-index on files after updating them. Git then runs the clean filter, and the clean filter tells git there's not been any real modification of the file.
git-annex needs to hard link from its object store to a work tree file, to avoid keeping two copies of the file on disk while preventing a rm or git checkout from deleting the only local copy. But the smudge interface does not provide a way to update the worktree itself.

So, git-annex's smudge filter does not actually provide the large file content. It just echos back the file as checked into git, and remembers that git wanted to check out that file. git-annex installs post-checkout, post-merge, and pre-commit hooks, which update the working tree files to make content from git-annex available. Of course, that means git sees modifications to the working tree, so git-annex then has to run git update-index on the files, which runs the clean filter, as described above.

(Not emitting large files from the smudge filter also avoids the problem with git leaking memory described earlier.)

And here's the consequences of git-annex's workarounds:

It doesn't use the long-running filter process interface by default, so git add of a lot of files runs git-annex smudge --clean once per file, which is slower than it could be. Using git-annex add avoids this problem. So does enabling git-annex filter-process, which is default in v9.
After a git-annex get/drop or a git checkout or pull that affects a lot of files, the clean filter gets run once per file, which is again, slower than ideal. Enabling git-annex filter-process can speed this up in some cases, and is default in v9.
When git-annex filter-process is enabled, it cannot use the trick described above that git-annex smudge --clean uses to avoid git piping the whole content of large files through it. The whole file content has to be read, even when git-annex does not need to see it. This mainly slows down git add when it is being used with an annex.largefiles confguration to add a large file to the annex, by about 5%. (incremental hashing for add would improve performance)
In a rare situation, git-annex would like to get git to run the clean filter, but it cannot because git has the index locked. So, git-annex has to print an ugly warning message saying that git status will show modififcations to files that are not really modified, and giving a command to fix the git status display.
git does not run any hook after a git stash or git reset --hard, or git cherry-pick, so after these operations, annexed files remain unpopulated until the user runs git annex fix.

The best fix would be to improve git's smudge/clean interface:

Add hooks run after every work tree update or after git stash and git reset --hard
Avoid buffering smudge filter output in memory.
Allow smudge filter to modify the work tree itself. (I developed a patch series for this in 2016, but it didn't land. --Joey)
Allow clean filter to read work tree files itself, to avoid overhead of sending huge files through a pipe.

RSS Atom

status?

Why was the smudge filter improvements not accepted upstream? All I could find was this post which explains it was "discarded" because of "bitrot"...

Also: the third "suboptimal" point is that "piping large file contents between git and filters is innefficient". Why? Aren't there ways to efficiently pipe contents around processes? For example, pv uses the splice(2) system call instead of just read/write on the pipes...

Thanks for any details!

Comment by anarcat — Fri Nov 30 19:46:20 2018

Remove comment

comment 2

At the same time I was working on that patch set, there was another patch being developed that affected the same filters (adding the long-running filter interface to git). I think there may have also been some uncertianty, on the part of the git developers, in taking the filter interface in the direction I wanted to.

(Also honestly, I can only rebase large C patch sets some many times before my time feels better spent doing something else. :-/)

On piping efficiency, splice() doesn't avoid the whole file needing to be read in and and written to the pipe, which is the main bottleneck. And the way the smudge/clean filters are used by git-annex (and git-lfs), bouncing the unaltered file content back out the other side using splice() doesn't seem like it would be useful.

Comment by joey — Sat Dec 1 15:47:48 2018

Remove comment

comment 3

git docs say "Depending on the version that is being filtered, the corresponding file on disk may not exist, or may have different contents. So, smudge and clean commands should not try to access the file on disk, but only act as filters on the content provided to them on standard input." Are there cases where this could cause problems?

Comment by Ilya_Shlyakhter — Fri Sep 20 20:46:41 2019

Remove comment

git docs on "%f"

Ilya, the patch that added the text you quote (52db4b0467, clarify %f documentation, 2016-07-11) was written by Joey, so I think it's safe to say that git-annex filters don't rely on "%f" being an actual file on disk.

Comment by kyle — Tue Sep 24 18:10:16 2019

Remove comment

comment 5

The way git-annex uses the '%f' file, the smudge filter does not actually write to it, so that's fine. The clean filter does directly access the file, but this would only be a problem if git ran the clean filter with stdin not matching the file. As far as I know git never does, because the clean filter is only used by git add (or commit -a).

However, the lack of clarity regarding when it's safe to really access '%f' is indeed a problem with git's interface; git-annex has to very carefully work around git's current, not fully specified behavior.

Comment by joey — Mon Sep 30 19:24:53 2019

Remove comment

Other benefits of proposed improvements to the smudge/clean API

You forgot to mention that direct work-tree access would enable GA-like systems to use filesystem copy-on-write support to make smudge/clean filters nearly instant, and such tools much more usable.

Comment by hello — Tue Jan 26 14:25:59 2021

Remove comment

comment 7

If you're dealing with a repository where you never intend to use unlocked files at all, does it work to just disable the smudge filter entirely?

I'm currently running into the speed issues on git add, but I don't have any use for unlocked files in this particular use case, so I'd be happy just to turn off that whole step.

Comment by tomdhunt — Mon Mar 7 18:27:38 2022

Remove comment

comment 8

@tomdhunt yes:

git config --unset filter.annex.smudge
git config --unset filter.annex.clean

But, it seems likely that git add would not be slow if you upgraded the repository version to v9 instead. Most scenarios where it is slow have been fixed in that version.

git-annex upgrade

Comment by joey — Mon Mar 7 19:20:33 2022

Remove comment

comment 9

FWIW, in dandisets/000026 I provide provide some timings on committing files renames. Overall rough: pure git (no git-annex) - 1 sec, v8 - 1500 minutes, v9 - 1 minute. The question is, since v9 is not default, how well was it tested/used in real life?

Comment by yarikoptic — Sun Jul 24 16:52:33 2022

Remove comment

comment 10

Well, there has been one bug noticed in v9 so far, it happens to be fixed in the upcoming release, and it only affected an edge case involving empty files.

I'm not waiting to make v9 the default because of a lack of testing. There were some complaints that v8 became the default too quickly and caused problems when a mixture of git-annex versions was in use, and so it made sense to slow down the upgrade speed.

Anyway, you can enable the speedup in v8 repositories:

git config filter.annex.process "git-annex filter-process"

Comment by joey — Mon Jul 25 16:27:36 2022

Remove comment

comment 11

What about v10? ;-) FWIW, I added a test run for DataLad to operate on v10 by default -- remains green.

git-annex upgrade --version=9

FWIW 10.20220624-g769be12 seems to not have --version for upgrade and I see nothing

git config filter.annex.process "git-annex filter-process"

cool -- seems to work (as commit took only 3m on that slow box). I will go this route for now.

Comment by yarikoptic — Mon Jul 25 19:03:42 2022

Remove comment

comment 12

v10 only differs in a locking mechanic, anyway I'm using it myself so.

git-annex upgrade will go from v8 to v9, no --version needed.

Comment by joey — Mon Jul 25 20:29:11 2022

Remove comment

Add a comment