Is there a better way to de-duplicate in a way that considers multiple backends?
Multiple backends can be added to the .git/config annex.backends entry, but what is the purpose of the secondary backends? The first is used when adding new files, but the second (third, fourth, ...) do not seem to serve any purpose. (Or am I missing something?)
Here's my use case, problem, and a possible solution. I frequently use git-annex to de-duplicate. The default SHA256E backend has caused issues since filename case is significant, so I have partially switched to SHA256. I also occasionally use other backends. Now when I'm given an arbitrary file, as far as I can tell, I have to try de-duplicate once for every possible backend which amounts to something like
for i SHA256E SHA256 SKEIN256 ... ; do
[ -f /tmp/afile.pdf ] && git annex import --clean-duplicates --backend=$i /tmp/afile.pdf
done
even though my .git/config has annex.backends = "SHA256E SHA256 SKEIN256 ...". I was surprised that --clean-duplicates
does not honour all listed annex.backends. In this case hashing multiple times as needed seems quite reasonable IMO, so adding multiple backend support for --clean-duplicates
would solve the problem. If you're not keen to modify this existing behaviour, it might be instead sensible to have to opt-in by explicitly specifying all backends to consider, like
git annex import --clean-duplicates --backends="SHA256E SHA256 SKEIN256" /tmp/afile.pdf
or
git annex import --clean-duplicates --backends="$( git config --get annex.backends )" /tmp/afile.pdf
Moving this loop into git-annex would also allow hashing to be parallelized; it currently cannot because the file could disappear.
PS. Thanks for git-annex Joey. I have around 100 annexes and rely on them on a daily basis.
-supernaught
When git-annex is adding a file, a backend can chose to not generate any key, and then it will try the next backend in the list.
The only backend that does that is the URL backend. So if someone lists URL first for some reason, it'll fall back to a backend that is usable. It could just as well crash in that edge case; the annex.backends UI happened before the needs of backends were perfectly understood. (As did the "backend" name...)
Anyway, I see the use case, but..
git annex import
actually honors annex.backend settings in .gitattributes before annex.backends in git-config. So, relying on it using the latter to make it check multiple backends won't always work. I don't think it would be good to complicate the .gitattributes annex.backends and --backend to support a list of backends.It seems it would be just as fast for you to run git-annex import once per backend, rather than compliciating it to try multiple backends.
I think that if annex.backends were not a list for historical reasons, I'd be suggesting a small shell script is your best option.
And so rather than add a new feature just because annex.backends is historically a list, I'd rather perhaps deprecate annex.backends as unncessarily complicated, and make annex.backend be a single-backend setting. (Just did that.)
Sorry this didn't quite go the way you wanted! If there is a disadvantage to the simple shell script option, please do let me know..
Your solution is reasonable. Simpler is better.
Thanks.