Please consider making the *E
backends lower-case the file extensions. This improves deduplication especially with image files, where cameras often use *.JPG
, while other software such as the Dropbox Android app rename the images to *.jpg
.
We talked about that very shortly a few months ago:
(Log starts 2015-08-06 21:11:40 CET)
tribut is there a backend that lowercases the extension? or do the *E backends do that anyway?
joeyh it does not
tribut but would it make sense? or am i missing something?
joeyh I don't know.. the extension is only there for stupid programs that follow symlinks and
check extensions. If such a program cares about .GIF vss .gif, you might have a problem
joeyh I think that you can git-annex migrate from hashE to hash, then migrate back, and it'll
update to the new file extension.
tribut i was thinking about content-identical images with .JPG or .jpg extension
tribut and because even the most retarded of programs wont care, i thought the backend could
lowercase the extension
joeyh ah, sure, using the E backend reduces the ability to de-duplicate
joeyh I'd not want to add a e backend set just for this. There's no requirement that the
extension extraction code be stable, so it could be considered changing it to lower-case
joeyh otoh, I have no idea if some programs are dumb enough to care about .git vs .GIF
Joey implemented the '--force' option in git annex migrate at my request to deal with this issue.
It's not the smoothest method, but it works well enough
Thanks for the hint, CandyAngel.
This is useful, however it has to be done periodically. I would still prefer if those keys were never created in the first place.
Let's step back a second and think about why the extension is included at all: Some programs, particularly on OSX but also IIRC one or two on linux such as calibre, when presented with a symlink, follow the symlink and look at the extension of the file it points to in order to guess what type of file it is.
That's the entire reason. Since people may end up using such a program or sharing a repo with someone who does, it defaults to SHA256E.
If you are not in that situation, and are being bothered by the extension being preserved, you can just set annex.backend to SHA256.
Also, if someone does have that problem with a repo using SHA256, they can
git annex adjust --unlocked
and get around the problem that way. (Though there are enough caveats about unlocked files that is may not be a suitable solution for everyone.)And I still don't know if some program might care about a particular casing of a filename extension. It seems possible at least. There's also the problem that, if git-annex started lower-casing extensions when adding new files, it would cause exactly the problem that michael is complaining about above, who are in the opposite situation of having added filenames that have upper-case extensions and would see deduplication stop working for those after git-annex changed.
(Most of which, on review, I said already in the IRC log at the top.)
My feeling at this point is, defaulting to SHA256E was probably a mistake, it makes git-annex worse for users who are not afflicted by stupid symlink-following software in an effort to cater to those who are. It would probably be better to swtich to SHA256 by default and let whatever users are bitten by that either manually enable SHA256E or use unlocked files.