I wholeheartedly agree that this could be useful. On that note I (meekly) query why not put annex.backend
also in the set of configs stored in the git-annex branch? Personally I have a script/command file called
_scripts/init-git-annex.cmd checked into the repo so that after cloning I can easily git config any settings
I need and also git annex init the clone by running said script. In OOP terminology it works as a kind
of constructor, if you will.
It's already possible to configure the backend via .gitattributes;
that is strictly better than a git-annex config setting for it would be:
It allows selecting different backends in for different types of files, and
even in different branches.
Anyway, this todo can only be about one thing, and that's
maxextensionlength, not backends or whatever other suggestions might have
later be added to it.
I went back and looked at why this config was added in the first place, and
it seemed to be partly to support extensions of length 5, but also partly
someone wanted to configure it to something large, so the entire filename
(or a lot of it) was treated as an "extension" and appeared in a key!
It should probably be hard capped to something sensible. Setting something
wacky like that locally is shooting your own foot, but once it becomes a global
setting, you can make other users shoot their feet.
I can see how, if a repo's domain includes files that have a longer than
usual extension, it would be useful to set this globally. But I also
have to push back against adding global configs, because the general
trend is toward users requesting a global config for everything, which
is not where I want git-anenx to end up. global configs can be an imposition
on other users of the repository, and also risk slowing git-annex down.
It's already possible to configure the backend via .gitattributes;
Thanks for reminding me about that option. I shall use the .gitattributes trick to shorten
my init-git-annex.cmd script.
joey:
It should probably be hard capped to something sensible.
As long as that sensible setting is >= 8 as that is a lower bound is what I use. I tend to think
that at most two extension components is what you usually need for identifying a file's "type" (think ".tar.gz").
As for the length of a component, Windows (and probably other modern desktop environments) produces
for instance such extensions as ".search-ms" (9 characters excluding the dot) for saved Explorer searches.
I don't think that's an unreasonably long extension component for this use case. For more standard file types
to be exchanged via public methods maybe at most five characters per component should be the maximum. However,
git-annex should be lenient towards what it accepts as long as that is reasonably possible to uphold. IMHO, a practical
upper limit for annex.maxextensionlength could be 16 to 20 characters.
WOM and URL keys already contain filenames or parts of filenames. Docs warn that too-long keys might cause issues on Windows, but also that 512 and 384 length hashes already have that problem. So it seems that permitting larger maxextensionlength should not add new issues?
What about adding annex.maxextensionlength to .gitattributes, rather than to git-annex-config? This would solve the use case of .fastq.gz files common in bioinformatics repos.
Well, maxextensionlength does at least have some connection to the
filename, which does point at it being somewhat suitable for
gitattributes. Eg, you might want to allow files with one long extension
to include it all, but not others.
The current implementation uses a single git check-attr process which
reports on all 4 currently supported git-annex attributes for every query.
So adding a new attribute also slows down queries for all the rest by some
amount. To quantify that, I did a benchmark, modifiying git-annex to
request (but ignore) an additional attribute. git-annex add of 10000
small files did not slow down by any measurable amount due to that change.
(1:53.78 before the change and 1:52.00 afterwards actually, so the
difference is lost in the noise)
An attribute does impose more overhead than git-annex
config though, since it has to be queried for every file rather than one
time. I benchmarked the overhead of checking annex.maxextensionlength
once per file added, over 10000 files. It slowed down by around 5%.
git-annex add already queries the annex.largefiles attribute once per file.
So, if it could also query annex.maxextensionlength at the same time,
it would be free to support the attribute. And then I'd certianly be in
favor of it. This would need some nontrivial refactoring of the code.
I wholeheartedly agree that this could be useful. On that note I (meekly) query why not put
annex.backend
also in the set of configs stored in the git-annex branch? Personally I have a script/command file called_scripts/init-git-annex.cmd
checked into the repo so that after cloning I can easilygit config
any settings I need and alsogit annex init
the clone by running said script. In OOP terminology it works as a kind of constructor, if you will.It's already possible to configure the backend via .gitattributes; that is strictly better than a
git-annex config
setting for it would be: It allows selecting different backends in for different types of files, and even in different branches.Anyway, this todo can only be about one thing, and that's maxextensionlength, not backends or whatever other suggestions might have later be added to it.
I went back and looked at why this config was added in the first place, and it seemed to be partly to support extensions of length 5, but also partly someone wanted to configure it to something large, so the entire filename (or a lot of it) was treated as an "extension" and appeared in a key!
It should probably be hard capped to something sensible. Setting something wacky like that locally is shooting your own foot, but once it becomes a global setting, you can make other users shoot their feet.
I can see how, if a repo's domain includes files that have a longer than usual extension, it would be useful to set this globally. But I also have to push back against adding global configs, because the general trend is toward users requesting a global config for everything, which is not where I want git-anenx to end up. global configs can be an imposition on other users of the repository, and also risk slowing git-annex down.
What problems do you see this causing, other than some possible loss of deduplication?
Hard caps are the ultimate global configs
joey:
Thanks for reminding me about that option. I shall use the .gitattributes trick to shorten my init-git-annex.cmd script.
joey:
As long as that sensible setting is >= 8 as that is a lower bound is what I use. I tend to think that at most two extension components is what you usually need for identifying a file's "type" (think ".tar.gz"). As for the length of a component, Windows (and probably other modern desktop environments) produces for instance such extensions as ".search-ms" (9 characters excluding the dot) for saved Explorer searches. I don't think that's an unreasonably long extension component for this use case. For more standard file types to be exchanged via public methods maybe at most five characters per component should be the maximum. However, git-annex should be lenient towards what it accepts as long as that is reasonably possible to uphold. IMHO, a practical upper limit for annex.maxextensionlength could be 16 to 20 characters.
maxextensionlength
should not add new issues?need a clear criteria for adding git-annex-config settings will need to be resolved before I do anything about this.
annex.maxextensionlength
to.gitattributes
, rather than togit-annex-config
? This would solve the use case of.fastq.gz
files common in bioinformatics repos.Well, maxextensionlength does at least have some connection to the filename, which does point at it being somewhat suitable for gitattributes. Eg, you might want to allow files with one long extension to include it all, but not others.
The current implementation uses a single git check-attr process which reports on all 4 currently supported git-annex attributes for every query. So adding a new attribute also slows down queries for all the rest by some amount. To quantify that, I did a benchmark, modifiying git-annex to request (but ignore) an additional attribute. git-annex add of 10000 small files did not slow down by any measurable amount due to that change. (1:53.78 before the change and 1:52.00 afterwards actually, so the difference is lost in the noise)
An attribute does impose more overhead than git-annex config though, since it has to be queried for every file rather than one time. I benchmarked the overhead of checking annex.maxextensionlength once per file added, over 10000 files. It slowed down by around 5%.
git-annex add already queries the annex.largefiles attribute once per file. So, if it could also query annex.maxextensionlength at the same time, it would be free to support the attribute. And then I'd certianly be in favor of it. This would need some nontrivial refactoring of the code.