todo/option for (fast) compression on special remotes like "directory"git-annexhttp://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/git-annexikiwiki2024-01-25T15:53:59Zcomment 1http://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_1_8620167702725aa8fb0e42dfe3820520/Ilya_Shlyakhter2019-02-13T16:48:08Z2019-02-13T16:48:08Z
+1 for this. Maybe, add a config option to pipe a file through a particular program before sending it to a remote?
comment 2http://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_2_f964bb1a14f099e69f1e780ee7c7ef68/joey2019-03-19T19:57:36Z2019-03-19T17:35:52Z
<p>Does seem like a good idea.</p>
<p>Piping avoids needing to link git-annex with multiple compression
libraries, or picking the wrong compression libraries.
I suppose that it's no big deal for a special remote to require that
everyone who uses it has xz or whatever installed.</p>
<p>But: The <em>name</em> of the command that is used cannot be stored in the remote
config in the git-annex branch because that would be insecure. A mapping in
the git-annex code from compression formats to known-safe command names
would be one way, or a git config setting that has to be manually set when
enabling the remote. Or both.</p>
comment 3http://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_3_d4672f72a00509186cfc5dd85e1da140/Atemu2021-04-20T18:05:27Z2021-04-20T18:05:27Z
<p>Would it perhaps be possible to set the compression using filters like file name/extension?<br />
For example, I wouldn't want GA to waste time on compressing multimedia files that are already at entropy and, since they make up the majority of my special remote's content, re-writing them would be very time intensive (even more so when remote solutions are involved).
Certain compressors might also work better on some files types compared to others. <br />
This could be very important to scientists using datalad as they are likely to A. be working very specific kinds of data where certain compressors might significantly outperform others and B. have large quantities of data where compression is essential.</p>
<p>If compressors are going to be limited to a known-safe selection, an important aspect to keep in mind would be compression levels as some compressors like zstd can range from lzo-like performance characteristics to lzma ones.</p>
<p>Definitely a +1 on this one though, it would be very useful for my use-case aswell.</p>
comment 4http://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_4_4560eff896578ac2779b03ec3484d0b2/lucas.gautheron2021-06-05T10:10:57Z2021-06-05T10:10:57Z
<p>This would be <em>extremely</em> useful!</p>
<p>Especially if it was possible to turn compression on/off based on each file name or metadata.</p>
comment 5http://git-annex.branchable.com/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_5_a154a025d441db4e0140fee821d1791c/joey2024-01-25T15:53:59Z2024-01-23T16:25:28Z
<p>There would need to be a way for git-annex to tell if an object on a remote
was compressed or not, and with what compressor. The two reaonable ways to
do it are to use different namespaces for objects, or to make compression
an immutable part of a special remote's configuration that is set at
initremote time.</p>
<p>(A less reasonable way (IMHO) would be to record in the remote's state
for each object what compression was used for it; this would bloat the
git-annex branch.)</p>
<p>A problem with using different namespaces is that git-annex then will
have to do extra work to check for each one when retriving or checking
presence of an object.</p>
<p>When chunking is enabled, git-annex already has to check for chunked and
unchunked versions of an object. This can include several different chunk
sizes that have been in use at different times.</p>
<p>So, we have a bit of an expontential blowup when combining chunking with
compression namespaces. Probably the number of chunk sizes tried is less
than 4 (eg logged chunk size, currently configured chunk size (when
different), unchunked), and the number of possible compressors is less than
10. So 20x or so increase in overhead. So I'm cautious of the namespaces approach.</p>
<p>It would be possible to configure a single compressor that may be used at
initremote time, but let some objects be chunked and others not. That would
only double the overhead, and only for remotes with a compressor
configured.</p>
<p>As far as enabling compression based on filename though, it seems to me
that an easier way to do that would be to have one remote with compression
enabled, and a second one without it, and use preferred content to control
which files are stored in which.</p>