Please describe the problem.
Original report was about the change in behavior of git-annex (as a whole installed bundle) that .json files started to be added to git-annex instead of git, whenever .json files remained text files and .gitattributes was configured to add all files with mimetype of text to go into git.
It happened due to the fact that libmagic added handling for detecting .json files and reporting that they are application/json
instead of text/plain
as before.
Note that initial demand/idea behind adding treatment of mime types was actually to provide automation for the most reasonable decision making on what goes into git and what annex, based on either a file a text file or binary.
The bug report referenced above was just marked "done" with a comment that "the magic database changing behavior is not a bug in git-annex" without actually addressing the underlying issue. I even somehow got a wrong impression that "we had it fixed" and was surprised to stumble into it again. I think that the issue should be properly addressed, ideally without requiring users to adjust their .gitattribute
files (and introducing newer git annex version dependency), so that the desired behavior of having text files going into git, not git-annex, was maintained even across changes in libmagic DB.
One, IMHO the easiest way, now that -k
(keep going) issue was fixed in libmagic, would be for git-annex to treat "mimetype" specification as "if any mimetype matches" and ask libmagic about all mimetypes of a file, e.g.:
$> file --mime-type -Lk 1.json
1.json: application/json\012- text/plain
so that if any structured text file would soon acquire additional, more specific mimetype (e.g., .md
could be reported as application/markdown
, just not yet), previous specifications in .gitattributes would still work -- after all those files remain text/plain
files!
If strict matching (not sure yet about a use case where it would really be needed) by the most specialized mime type is needed, additional "mimetypefirstguess" or alike could be added.
Looks like we're agreed this is not necessary, so done --Joey
mimeencoding=binary
as the ultimate decision maker. So we are probably doomed (unless Joey sees reason in the reasoning above and implements that as well) to do go through all datasets and autoadjust them to usemimeencoding
instead ofmimetype
.Yes, mimeencoding=binary is intended for those cases where you just want a robust (presumably) text/binary division.
The "any mimetype matches" approach seems like it could break things too. Consider:
Currently a shell script is found to be only text/x-shellscript, so it would match the above. If git-annex were changed to consider all reported mime types, the shell script, being also text/plain would not match.
And then, once the mime database solves the halting problem and helpfully starts flagging shell scripts as AI/buggy (all shell scripts are presumably buggy so maybe that AI has an easy job), the behavior on the above example would change for a third time, back to matching.