Despite the possibility to compress data on a special remote with encryption there are use-cases in which it would come in handy to have an option for a special remote like "directory" just for compression.
For example, I use git annex for very large scientific tomographic datasets and files originating from their processing like segmentations, distance maps, skeletons. While compressing the raw data makes little sense, compression e.g. segmentations and skeletons has a huge impact on the effective files size. Since compressing files of a few GBs to TBs is time consuming, I prefer to have an uncompressed version in the working tree (so I do not use file formats that are using compression by default e.g. .nii.gz) but it would be very helpful to have the option to push precious or older versions to a remote that then uses compression. Using encryption for this is a bit of an overkill and takes considerably longer than compressing with e.g. pbzip
. A compressed file system for this purpose is no option, because the special remote is supposed to live on a restrictive archive server.
Though, I guess, it would be possible to write a special remote wrapper for this, I wonder if this might qualify as an officially supported option to the already existing special remotes like "directory" or "rsync". E.g. in conjunction to encryption
something like compression
with possible values like pbzip
, bzip
, pigz
and gzip
.
Does seem like a good idea.
Piping avoids needing to link git-annex with multiple compression libraries, or picking the wrong compression libraries. I suppose that it's no big deal for a special remote to require that everyone who uses it has xz or whatever installed.
But: The name of the command that is used cannot be stored in the remote config in the git-annex branch because that would be insecure. A mapping in the git-annex code from compression formats to known-safe command names would be one way, or a git config setting that has to be manually set when enabling the remote. Or both.
Would it perhaps be possible to set the compression using filters like file name/extension?
For example, I wouldn't want GA to waste time on compressing multimedia files that are already at entropy and, since they make up the majority of my special remote's content, re-writing them would be very time intensive (even more so when remote solutions are involved). Certain compressors might also work better on some files types compared to others.
This could be very important to scientists using datalad as they are likely to A. be working very specific kinds of data where certain compressors might significantly outperform others and B. have large quantities of data where compression is essential.
If compressors are going to be limited to a known-safe selection, an important aspect to keep in mind would be compression levels as some compressors like zstd can range from lzo-like performance characteristics to lzma ones.
Definitely a +1 on this one though, it would be very useful for my use-case aswell.
This would be extremely useful!
Especially if it was possible to turn compression on/off based on each file name or metadata.
There would need to be a way for git-annex to tell if an object on a remote was compressed or not, and with what compressor. The two reaonable ways to do it are to use different namespaces for objects, or to make compression an immutable part of a special remote's configuration that is set at initremote time.
(A less reasonable way (IMHO) would be to record in the remote's state for each object what compression was used for it; this would bloat the git-annex branch.)
A problem with using different namespaces is that git-annex then will have to do extra work to check for each one when retriving or checking presence of an object.
When chunking is enabled, git-annex already has to check for chunked and unchunked versions of an object. This can include several different chunk sizes that have been in use at different times.
So, we have a bit of an expontential blowup when combining chunking with compression namespaces. Probably the number of chunk sizes tried is less than 4 (eg logged chunk size, currently configured chunk size (when different), unchunked), and the number of possible compressors is less than 10. So 20x or so increase in overhead. So I'm cautious of the namespaces approach.
It would be possible to configure a single compressor that may be used at initremote time, but let some objects be chunked and others not. That would only double the overhead, and only for remotes with a compressor configured.
As far as enabling compression based on filename though, it seems to me that an easier way to do that would be to have one remote with compression enabled, and a second one without it, and use preferred content to control which files are stored in which.