I am using Google Cloud as a special remote to backup photos in one annex.
It's in my interests, due to storage costs, to keep duplicates to a minimum. I've noticed that the cloud bucket has a lot more files than I would have expected.
I performed the following to get the number of files in the bucket:
$ gsutil ls gs://cloud-<uuid>/ | wc -l
<one large number>
and of course, in the git-annex:
$ find . -path .git -o \( -type f -print \) | wc -l
<one not as large number, about 60% of the larger number>
When I took a sample of the list, I found may of the files that shared the exact key, but had, for example. both .jpg and .JPG suffixes.
Some of these files, when checked with git-annex whereused --key=, some were files I had maybe re-arranged in folders. I'm running a script at the moment to get a wider sample of what's happening with these duplicates.
One of the problems I speculate may be coming into play is that I did originally copy files to the cloud remote via an older, Windows git-annex version (repo version I think at 5?).
I think over time interoperability issues have ironed out, but this might be a remnant of when these were more prolific.... though I do still have one that I will report separately once I've gather more data.
Any advise on how I might deal with the duplicates, get them deleted out of the bucket?
OK, please ignore. These scenarios account for only about 23 of the duplicates. There may still be, or have been, a problem.. maybe? Because I did add the same file in two different places,with the two case differing (but same) suffixes.
I am running another audit to work out why I have so many more files in the cloud than reachable in the annex. I suppose it's because I removed files previously, but I'm not sure.
Apologies again. I've realised that my count of the find command is totally wrong. It should be:
The reason being... for some reason I have some files that contain the key, and some actual symbolic links, all intermixed.
Is there something I should do to make this all uniform?
git-annex preserves the filename extension as-is, it does not try to normalize it to lower case or anything. See discussion at Lower-case extension for SHA256E and similar.
You can configure git-annex to use SHA256 rather than the default SHA256E so the extension is not used and it will deduplicate better. Eg, in .gitattributes:
You could then run
git-annex migrate
on existing files to switch them to that backend. Bear in mind that you would have to re-upload the files to your special remote, and would have togit-annex dropunused --from
that remote to clean up the duplicate data stored there.