For some file types, e.g. images and sound, I would like to also add some local sensitive hash (LSH), and be able to find duplicates that way. E.g. that would allow me to find duplicate images where the image metadata was changed, or maybe the quality was changed, etc.
Ideally, I would want to add such meta data to Git-Annex. I'm not sure if this should be a backend (backends) or kept separate from it (as there will be collisions, by design).
I also want fast lookups for some specific hash. I'm not sure if the Git-Annex metadata allow for that? A Git-Annex backend naturally has this feature, but then how would it handle collisions? I would want that this LSH backend just links to a list of matching files, i.e. symlinks to the real files (via SHA backend or so).
Is anything like that already supported?
If not, would it be possible to add such support to Git-Annex? How?
One of the requirements of the backend is that a collision means the content is identical, so it's trivial to handle them because it doesn't matter which one you keep. For dealing with "near duplicates", I'd suggest adding a field with
git annex metadata -s lsh=$(lsh $filename)
or something like that. The metadata is attached to the file's content rather than the name, which will ensure that the LSH gets recomputed if the content changes and will never get computed more than once for identical content.I think the main drawback of this method is that it's a little more complicated to print metadata en masse than it is to print the key because
git annex find
doesn't support metadata. It's certainly possible to construct a command to do it, it's just a little more involved than the commands for finding duplicates.