I'm considering ways to get rid of direct mode, replacing it with something better implemented using ?smudge filters.
git-lfs
I started by trying out git-lfs, to see what I can learn from it. My feeling is that git-lfs brings an admirable simplicity to using git with large files. For example, it uses a push-hook to automatically upload file contents before pushing a branch.
But its simplicity comes at the cost of being centralized. You can't make a git-lfs repository locally and clone it onto other drive and have the local repositories interoperate to pass file contents around. Everything has to go back through a centralized server. I'm willing to pay complexity costs for decentralization.
Its simplicity also means that the user doesn't have much control over what files are present in their checkout of a repository. git-lfs downloads all the files in the work tree. It doesn't have facilities for dropping files to free up space, or for configuring a repository to only want to get a subset of files in the first place. Some of this could be added to it I suppose.
I also noticed that git-lfs uses twice the disk space, at least when initially adding files. It keep a copy of the file in .git/lfs/objects/, in addition to the copy in the working tree. That copy seems to be necessary due to the way git smudge filters work, to avoid data loss. Of course, git-annex manages to avoid that duplication when using symlinks, and its direct mode also avoids that duplication (at the cost of some robustness). I'd like to keep git-annex's single local copy feature if possible.
replacing direct mode
Anyway, as smudge/clean filters stand now, they can't be used to set up git-annex symlinks; their interface doesn't allow it. But, I was able to think up a design that uses smudge/clean filters to cover the same use cases that direct mode covers now.
Thanks to the clean filter, adding a file with git add
would check in a
small file that points to the git-annex object.
In the same repository, you could also use git annex add
to check
in a git-annex symlink, which would protect the object from modification,
in the good old indirect mode way. git annex lock
and git annex unlock
could switch a file between those two modes.
So this allows mixing directly writable annexed files and locked down annexed files in the same repository. All regular git commands and all git-annex commands can be used on both sorts of files. Workflows could develop where a file starts out unlocked, but once it's done, is locked to prevent accidental edits and archived away or published.
That's much more flexible than the current direct mode, and I think it will be able to be implemented in a simpler, more scalable, and robust way too. I can lose the direct mode merge code, and remove hundreds of lines of other special cases for direct mode.
The downside, perhaps, is that for a repository to be usable on a crippled filesystem, all the files in it will need to be unlocked. A file can't easily be unlocked in one checkout and locked in another checkout.
that would be pretty awesome! thanks so much for looking into what others are doing: it takes great maturity and respect for others, something that is so often missing online...
i hope this can solve a bunch of WTF issues i've had with direct mode, which already improved by leaps and bounds, mind you.
speaking of which - would it make sense for git-annex to support lfs remotes out of the box? or is considered builtin to git (ie. if you install git-lfs, you can already have a hybrid lfs/annex repo?)
Sure, it would be great to have a special remote supporting the git-lfs storage backend. This would let git-annex repos be uploaded to github along with the annexed files, which is a nice diversity to have in addition to gitlab's support for git-annex.
The API is documented, so it's certianly doable, as an external special remote even.
git add
rather thangit annex add
. But ifgit add
would add files via the smudge/clean process, how would one check files directly into git? Would it no longer be possible?If I'm understanding correctly, that one downside (requiring all checkouts to have all files be direct if any filesystems require it) seems to be a fairly major limitation, no? Changing the concept of locked/unlocked files from being a local, per-repo concern to a global one seems like quite a major change.
For instance, would mean that any public repo using git annex for distributing a set of data files would either have to have all files be unlocked, or else no one would be able clone onto a FAT32-formatted external hdd?
FWIW, the particular use case I'm concerned about personally is having my annexes on my android device.
I'm concerned about that too. But it may be possible to finesse it, when git-annex is running on a crippled filesystem, it may be able to unlock all files as it gets content for them, producing a local fork.
The first difficulty would be avoiding or autoresolving conflicts between locked and unlocked when merging changes into that fork. I think this is very tractable; such a conflict comes down mostly to the symlink bit in the tree object.
The real difficulty would be that any pushes from that fork would include its change converting all files to unlocked. Although it's fairly mechanical to convert such a commit into one that doesn't unlock files, so perhaps that could be automated somehow on push or merge.
There's also a small and probably easy to implement git change that would avoid all this complexity: If git's smudge filters were optionally able to run on the link-text of symlinks, then a file could be unlocked locally without changing what's in the repo and all the smudge stuff would still work on it.
Crippled filesystems aside, I think there's value in being able to unlock files across clones of a repo. For example, a repo could have a workflow where the files for the current episiode/experiment/whatever start out unlocked and are locked once it's complete.
Thanks for the quick reply (and all your work on this!)
Interesting, that change to git does sound like it should be relatively small compared to the workarounds needed. But in any case, glad to hear you're thinking about the issue.
Also curious what your thoughts are on the performance issues you had identified previously with using smudge/clean on larger repos. Do the changes in git 2.5 address all your concerns? Or are there still some cases where this will potentially result in significant slow-down?
I'm still not entirely happy with the smudge/clean interface's performance. At least git doesn't fall over if the clean filter declines to read all the content of the large file on stdin anymore, which was the main thing preventing an optimised use of it. Still, git has to run the clean filter once per file, which is suboptimal. And, the smudge filter can't modify the file in the work tree, but instead has to pass the whole file content back to git for writing, which is going to result in a lot of unnecessary context switches and slowdown. Especially in git-annex's case, where all it really needs to do is make a hard link to the content.