git-annex should use smudge/clean filters.
implementation todo list
Reconcile staged changes into the associated files database, whenever the database is queried. This is needed to handle eg:
git add largefile git mv largefile othername git annex move othername --to foo # fails to drop content from associated file othername, # because it doesn't know it has that name # git commit clears up this mess
If an annexed file's content is not present, and its pointer file is copied to a new name and added, it does not get added as an associated file. (If the content is present, it does get added.)
If an unlocked file's content is not present, and a new file with identical content is added with
git add, the unlocked file is populated, but git-annex is unable to update the index, so git status will say that it has been modified.If an annexed file is deleted in the index, and another annexed file uses the same key, and git annex get/drop is run, the index update that's done to prevent status showing the file as modified adds the deleted file back to the index.
Also, if the user is getting files, and modifying files at the same time, and they stage their modifications, the modification may get unstaged in a race when a file is got and the updated worktree file staged in the index.
I don't know if this is worth worrying about, because there's also of course a race where the modification to the worktree file may get reverted when git-annex updates the content. Those races are much smaller, but do exist.
get/drop operations on unlocked files lead to an update of the index. Only one process can update the index at one time, so eg, git annex get at the same time as a git commit may display a ugly warning (or the git commit could fail to start if run at just the right time).
Two git-annex get processes can also try to update the index at the same time and encounter this problem (git annex get -J is ok).
Potentially: Use git's new
filter.<driver>.processinterface, which will let only 1 git-annex process be started by git when processing multiple files, and so should be faster.See Long Running Filter Process .. it's not currently actually a win but might be a good way to improve git to work better with v6.
Checking out a different branch causes git to smudge all changed files, and write their content. This does not honor annex.thin. A warning message is printed in this case.
This is particularly wasteful when checking out an adjusted unlocked branch, which causes 2x the space to be used.
"git annex proxy" could be used to handle this. Make it run the git command with smudge filters disabled and then scan through the changed files in the work tree, and update pointer files to be hard links to their content.
git-annex adjust and git-annex sync could both use that internally when checking out the adjusted branch, and merging a branch into HEAD.
(My enhanced smudge/clean patch set also fixed this problem, in a much nicer way...)
When git runs the smudge filter, it buffers all its output in ram before writing it to a file. So, checking out a branch with a large v6 unlocked files can cause git to use a lot of memory.
This needs to be fixed in git, but my proposed interface in http://thread.gmane.org/gmane.comp.version-control.git/294425 would avoid the problem for git checkout, since it would use the new interface and not the smudge filter.
Last verified with git 2.18 in 2018.
Note that the long-running filter process interface has the same problem.
Eventually (but not yet), make v6 the default for new repositories. Note that the assistant forces repos into direct mode; that will need to be changed then, and it should enable annex.thin instead.
- Later still, remove support for direct mode, and enable automatic v5 to v6 upgrades.
historical notes
2013: Currently, this does not look likely to work. In particular, the clean filter needs to consume all stdin from git, which consists of the entire content of the file. It cannot optimise by directly accessing the file in the repository, because git may be cleaning a different version of the file during a merge.
So every git status would need to read the entire content of all
available files, and checksum them, which is too expensive.
Update from GitTogether: Peff thinks a new interface could be added to git to handle this sort of case in an efficient way.. just needs someone to do the work. --Joey
Update 2015: git status only calls the clean filter for files that the index says are modified, so this is no longer a problem. --Joey
background
The clean filter is run when files are staged for commit. So a user could copy any file into the annex, git add it, and git-annex's clean filter causes the file's key to be staged, while its value is added to the annex.
The smudge filter is run when files are checked out. Since git annex repos have partial content, this would not git annex get the file content. Instead, if the content is not currently available, it would need to do something like return empty file content. (Sadly, it cannot create a symlink, as git still wants to write the file afterwards.)
So the nice current behavior of unavailable files being clearly missing due to dangling symlinks, would be lost when using smudge/clean filters. (Contact git developers to get an interface to do this?)
Instead, we get the nice behavior of not having to remeber to git annex
add files, and just being able to use git add or git commit -a,
and have it use git-annex when .gitattributes says to. Also, annexed
files can be directly modified without having to git annex unlock.
configuration
In .gitattributes, the user would put something like "* filter=git-annex". This way they could control which files are annexed vs added normally.
It would also be good to allow using this without having to specify the files in .gitattributes. Just use "* filter=git-annex" there, and then let git-annex decide which files to annex and which to pass through the smudge and clean filters as-is. The smudge filter can just read a little of its input to see if it's a pointer to an annexed file. The clean filter could apply annex.largefiles to decide whether to annex a file's content or not.
For files not configured this way in .gitattributes, git-annex could continue to use its symlink method -- this would preserve backwards compatability, and even allow mixing the two methods in a repo as desired. (But not switching an existing repo between indirect and direct modes; the user decides which mode to use when adding files to the repo.)
clean
The trick is doing it efficiently. Since git a2b665d, v1.7.4.1, something like this works to provide a filename to the clean script:
git config --global filter.huge.clean huge-clean %f
This could avoid it needing to read all the current file content from stdin when doing eg, a git status or git commit. Instead it is passed the filename that git is operating on, in the working directory. (Update: No, doesn't work; git may be cleaning a different file content than is currently on disk, and git requires all stdin be consumed too.)
So, WORM could just look at that file and easily tell if it is one it already knows (same mtime and size). If so, it can short-circuit and do nothing, file content is already cached.
SHA1 has a harder job. Would not want to re-sha1 the file every time, probably. So it'd need a local cache of file stat info, mapped to known objects.
But: Even with %f, git actually passes the full file content to the clean
filter, and if it fails to consume it all, it will crash (may only happen
if the file is larger than some chunk size; tried with 500 mb file and
saw a SIGPIPE.) This means unnecessary works needs to be done,
and it slows down everything, from git status to git commit.
showstopper I have sent a patch to the git mailing list to address
this. http://marc.info/?l=git&m=131465033512157&w=2 (Update: apparently
can't be fixed.)
Update: I tried this again (2015) and it seems that git status and git add avoid re-sending the file content to the clean filter, as long as the file stat has not changed. I'm not sure when git started doing that, but it seems to avoid this problem. --Joey
smudge
The smudge script can also be provided a filename with %f, but it cannot directly write to the file or git gets unhappy.
Still the case in 2015. Means an unnecesary read and pipe of the file even if the content is already locally available on disk. --Joey
partial checkouts
.. Are very important, otherwise a repo can't scale past the size of the smallest client's disk!
It would be nice if the smudge filter could hard link a work tree file to the annex object.
But currently, the smudge filter can't modify the work tree file on its own -- git always modifies the file after getting the output of the smudge filter, and will stumble over any modifications that the smudge filter makes. And, it's important that the smudge filter never fail as that will leave the repo in a bad state.
Seems the best that can be done is for the smudge filter to copy from the annex object when the object is present. When it's not present, the smudge filter should provide a pointer to its content.
The clean filter should detect when it's operating on that pointer file.
I've a demo implementation of this technique in the scripts below.
deduplication
.. Is nice; needing 2 copies of every annexed file is annoying.
Unfortunately, when using smudge/clean, git merge does not preserve a
smudged file in the work tree when renaming it. It instead deletes the old
file and asks the smudge filter to smudge the new filename.
So, copies need to be maintained in .git/annex/objects, though it's ok to use hard links to the work tree files. (Although somewhat unsafe since modification of the file will lose the old version. annex.thin setting can enable this.)
Even if hard links are used, smudge needs to output the content of an annexed file, which will result in duplication when merging in renames of files.
design
Goal: Get rid of current direct mode, using smudge/clean filters instead to cover the same use cases, more flexibly and robustly.
Use case 1:
A user wants to be able to edit files, and git-add, git commit, without needing to worry about using git-annex to unlock files, add files, etc.
Use case 2:
Using git-annex on a crippled filesystem that does not support symlinks.
Data:
- An annex pointer file has as its first line the git-annex key that it's standing in for (prefixed with "annex/objects/", similar to an annex symlink target). Subsequent lines of the file might be a message saying that the file's content is not currently available. An annex pointer file is checked into the git repository the same way that an annex symlink is checked in.
- A file map is maintained by git-annex, to keep track of the keys that are used by files in the working tree.
Configuration:
.gitattributes tells git which files to use git-annex's smudge/clean filters with. Typically, all files except for dotfiles:
- filter=annex .* !filter
annex.largefiles tells git-annex which files should in fact be put in the annex. Other files are passed through the smudge/clean as-is and have their contents stored in git.
annex.direct is repurposed to configure how git-annex adds files. When set to false, it adds symlinks and when true it adds pointer files.
git-annex clean:
Run by
git add(and diff and status, etc), and passed the filename, as well as fed the file content on stdin.Look at configuration to decide if this file's content belongs in the annex. If not, output the file content to stdout.
Generate annex key from filename and content from stdin.
Hard link (annex.thin) or copy .git/annex/objects to the file, if it doesn't already exist.
This is done to prevent losing the only copy of a file when eg doing a git checkout of a different branch, or merging a commit that renames or deletes a file. But, with annex.thin no attempt is made to protect the object from being modified. If a user wants to protect object contents from modification, they should use
git annex add, notgit add, or they cangit annex lockafter adding, or not enable annex.thin.Update file map.
Output the pointer file content to stdout.
git-annex smudge:
Run by eg
git checkoutand passed the filename, as well as fed the pointer file content on stdin.Update file map.
When an object is present in the annex, outputs its content to stdout. Otherwise, outputs the file pointer content.
git annex direct/indirect:
Previously these commands switched in and out of direct mode. Now they become no-ops.
git annex lock/unlock:
Makes sense for these to change to switch files between using git-annex symlinks and pointers. So, this provides both a way to transition repositories to using pointers, and a cleaner unlock/lock for repos using symlinks.
unlock will stage a pointer file, and will link the content of the object from .git/annex/objects to the work tree file.
lock will replace the current work tree file with the symlink, and stage it, and lock down the permissions of the annex object.
file map
The file map needs to map from Key -> [File]. File -> Key
seems useful to have, but in practice is not worthwhile.
Drop and get operations need to know what files in the work tree use a given key in order to update the work tree. And, we don't want to overwrite a work tree file if it's been modified when dropping or getting.
git-annex commands that look at annex symlinks to get keys to act on will
need fall back to either consulting the file map, or looking at the staged
file to see if it's a pointer to a key. So a File -> Key map is a possible
optimisation.
Question: If the smudge/clean filters update the file map incrementally based on the pointer files they generate/see, will the result always be consistent with the content of the working tree?
This depends on when git calls the smudge/clean filters and on what. In particular:
- Does the clean filter always get called when adding a relevant file to git? Yes.
- Is the clean filter called at any other time? Yes, for example git diff will clean relevant modified files to generate the diff. So, the clean filter may see file versions that have not yet been staged in git.
- Is the clean filter ever passed content not in the work tree? I don't think so, but not 100% sure.
- Is the smudge filter always called when git updates a relevant file in the work tree? Yes.
- Is the smudge filter called at any other time? Seems unlikely but then there could be situations with a detached work tree or such.
- Does git call any useful hooks when removing a file from the work tree,
or converting it to not be annexed, or for
git mvof an annexed file? No!
From this analysis, any file map generated by the smudge/clean filters is necessary potentially innaccurate. It may list deleted files. It may or may not reflect current unstaged changes from the work tree.
Follows that any use of the file map needs to verify the info from it, and throw out bad cached info (updating the map to match reality).
When downloading a key, check if the files listed in the file map are still pointer files in the work tree, and only replace them with the content if so.
When dropping a key, check if the files listed for it in the file map are unmodified in the work tree, and are staged as pointers to the key, and only reset them to the pointers if so. Note that this means that a modified work tree file that has not yet been staged, but that corresponds to a key, won't be reset when the key is dropped. This is probably not a big deal; the user will either add the file, which will add the key back, or reset the file.
Does the File -> Key map have any benefits given this innaccuracy?
Answer seems to be no; any answer that map gives may be innaccurate and
needs to be verified by looking at actual repo content, so might as well
just look at the repo content in the first place..
Upgrading
annex.version changes to 6
git config for filter.annex.smudge and filter.annex.clean is set up.
.gitattributes is updated with a stock configuration, unless it already mentions "filter=annex".
Upgrading a direct mode repo needs to switch it out of bare mode, and
needs to run git annex unlock on all files (or reach the same result).
So will need to stage changes to all annexed files.
When a repo has some clones indirect and some direct, the upgraded repo will have all files unlocked, necessarily in all clones. This happens automatically, because when the direct repos are upgraded that causes the files to be unlocked, while the indirect upgrades don't touch the files.
test files
huge-smudge:
#!/bin/sh
read f
file="$1"
echo "smudging $f" >&2
if [ -e ~/$f ]; then
cat ~/$f # possibly expensive copy here
else
echo "$f not available"
fi
huge-clean:
#!/bin/sh
file="$1"
cat >/tmp/file
# in real life, this should be done more efficiently, not trying to read
# the whole file content!
if grep -q 'not available' /tmp/file; then
awk '{print $1}' /tmp/file # provide what we would if the content were avail!
exit 0
fi
echo "cleaning $file" >&2
# XXX store file content here
echo $file
.gitattributes:
*.huge filter=huge
in .git/config:
[filter "huge"]
clean = huge-clean %f
smudge = huge-smudge %f

Have you checked what the smudge filter sees when the input is a symlink? Because git supports tracking symlinks, so it should also support pushing symlinks through a smudge filter, right? Either way: yes, contact the git devs, one can only ask and hope. And if you can demonstrate the awesomeness of git-annex they might get more 1interested
I think that is not true? If gits wants the file to be cleaned, it thinks that the file was changed. So you would have to SHA1 it anyway if you don't want rely on WORM (which git already does in the first step anyway).
If I understand this right, this feature should allow us to say to git: “Hey, from now on whenever I
git adda*.pngfile, add it to git-annex instead!”How about saying to git-annex: “Hey, whenever I
git-annex adda file which is not*.png, add it to git instead! Or at least leave it unadded so that I can decide later.” Is it possible now? If not, would it be reasonable to add such a feature?It seems git 2.5 allows smudge filters to not read all of stdin:
https://github.com/git/git/blob/master/Documentation/RelNotes/2.5.0.txt
" * Filter scripts were run with SIGPIPE disabled on the Git side, expecting that they may not read what Git feeds them to filter. We however treated a filter that does not read its input fully before exiting as an error. We no longer do and ignore EPIPE when writing to feed the filter scripts.
This changes semantics, but arguably in a good way. If a filter can produce its output without fully consuming its input using whatever magic, we now let it do so, instead of diagnosing it as a programming error."
Why not use hardlinks on Windows? This could fix "only direct mode" issue too. Symlinks on windows sure want admin privilege but hardlinks won't. At least not with this tool: http://schinagl.priv.at/nt/ln/ln.html
Hi. I read the post http://git-annex.branchable.com/tips/unlocked_files/, and got into this post through the comment to check the limitations of repo v6. I read through this post, but I still don't grasp the drawbacks of repo v6. I like the idea of mixed mode repositories with locked and unlocked co-existing. And I need repo v6 for that right? I also like the annex.largefiles to control what is annex what is normal git files. With these two combined feature I can configure my project and let users without too much burden on controlling what is annex, what is not, and what are locked or unlocked. I want to use repo v6, but what are the drawbacks on adopting it? Is it in terms of git could be bugged? Or the smudge filters could be bugged? Sorry for not being able to grasp this from the current post. I appreciate any help. Thanks.
@torarnv thanks for pointing that out.. I finally got around to verifying that, and was able to speed up the smudge filter. Also this avoids the problem that git for some reason buffers the whole file content in memory when it sends it to the smudge filter, which is a pretty bad memory leak in git that no longer affects this.