Hi, we are using git in a research lab to store both code and text log generated by this code. Until now, we are using standard git and then it compresses the text files when they are committed (our text log files have +- 100MB each one). Recently, our institutional git server started to support git-annex and then we are moving from a git only solution to a git + git annex case. Git annex is working like a charm and the operations in our repository are faster after the migration to git annex. However, the files stored with git annex (.git/annex) are not compressed. I understand that git annex was designed to store binary files and then, the compression does not make sense, but I would like to know if it is possible to enable the compression when pushing the files with git-annex?
Thank you for your attention!
The difficulty with compressing annexed files is that they have to be available on disk in uncompressed form in order for the work tree to point to the content of the files. Notice that, even though git does compress .git/objects, the checked-out files in the working tree are not themselves compressed.
git-annex does support compressing files that are stored on special remotes. Simply enabling encryption when initializing a special remote will also compress the data stored in it. A couple of special remotes like bup also compress content natively.
Using a filesystem that supports compression is the only way I know of to transparently compress files located in the working tree.
Thanks for your comments joey! In fact, compress files in the working tree is not mandatory. The main question is compress then in the git server (quota reasons). When we were using only git, it was slow (caused by huge files) but the files were compressed. Now, using git-annex, the operations are faster but the size of the repository increases a lot (due to lack of compression) and that is the problem once we've reached the disk quota in the git server.
Perhaps Git Annex could have first-class support for a local special remote inside the .git/annex dir where files that aren't checked out are stored in a more efficient manner.
This would mainly be useful for old versions of files you want to keep in the repo but don't need immediate access to or bare repos like in OP's case. Once special remotes support compression, it might make sense to make it the default storage method for bare repos actually.
Ideally these could be set to be any local special remote backend; bup would make an ideal candidate for storing old versions of documents efficiently for example.
Having files in this such a "local special remote" would then be equivalent to having them in the regular .git/annex/objects dir for tracking purposes.