be robust to additions of more specific mime types to libmagic (e.g. application/json)

Please describe the problem.

Original report was about the change in behavior of git-annex (as a whole installed bundle) that .json files started to be added to git-annex instead of git, whenever .json files remained text files and .gitattributes was configured to add all files with mimetype of text to go into git. It happened due to the fact that libmagic added handling for detecting .json files and reporting that they are application/json instead of text/plain as before.

Note that initial demand/idea behind adding treatment of mime types was actually to provide automation for the most reasonable decision making on what goes into git and what annex, based on either a file a text file or binary.

The bug report referenced above was just marked "done" with a comment that "the magic database changing behavior is not a bug in git-annex" without actually addressing the underlying issue. I even somehow got a wrong impression that "we had it fixed" and was surprised to stumble into it again. I think that the issue should be properly addressed, ideally without requiring users to adjust their .gitattribute files (and introducing newer git annex version dependency), so that the desired behavior of having text files going into git, not git-annex, was maintained even across changes in libmagic DB.

One, IMHO the easiest way, now that -k (keep going) issue was fixed in libmagic, would be for git-annex to treat "mimetype" specification as "if any mimetype matches" and ask libmagic about all mimetypes of a file, e.g.:

$> file --mime-type -Lk 1.json
1.json: application/json\012- text/plain

so that if any structured text file would soon acquire additional, more specific mimetype (e.g., .md could be reported as application/markdown, just not yet), previous specifications in .gitattributes would still work -- after all those files remain text/plain files!

If strict matching (not sure yet about a use case where it would really be needed) by the most specialized mime type is needed, additional "mimetypefirstguess" or alike could be added.

Looks like we're agreed this is not necessary, so done --Joey

RSS Atom

oh hoh, there is mimeencoding now

OH!! "Much ado about nothing". As Joey reported in datalad issues there is now handling of mimeencoding=binary as the ultimate decision maker. So we are probably doomed (unless Joey sees reason in the reasoning above and implements that as well) to do go through all datasets and autoadjust them to use mimeencoding instead of mimetype.

Comment by yarikoptic — Fri Dec 20 19:54:04 2019

Remove comment

comment 2

Yes, mimeencoding=binary is intended for those cases where you just want a robust (presumably) text/binary division.

The "any mimetype matches" approach seems like it could break things too. Consider:

(not mimetype=text/plain and (mimetype=text/* or mimetype=application/json)) or mimetype=AI/buggy

Currently a shell script is found to be only text/x-shellscript, so it would match the above. If git-annex were changed to consider all reported mime types, the shell script, being also text/plain would not match.

And then, once the mime database solves the halting problem and helpfully starts flagging shell scripts as AI/buggy (all shell scripts are presumably buggy so maybe that AI has an easy job), the behavior on the above example would change for a third time, back to matching.

Comment by joey — Fri Dec 20 20:00:14 2019

Remove comment

Add a comment