duplicates with suffixes differing only in case

I am using Google Cloud as a special remote to backup photos in one annex.

It's in my interests, due to storage costs, to keep duplicates to a minimum. I've noticed that the cloud bucket has a lot more files than I would have expected.

I performed the following to get the number of files in the bucket:

$ gsutil ls gs://cloud-<uuid>/ | wc -l
<one large number>

and of course, in the git-annex:

$ find . -path .git -o \( -type f -print \) | wc -l
<one not as large number, about 60% of the larger number>

When I took a sample of the list, I found may of the files that shared the exact key, but had, for example. both .jpg and .JPG suffixes.

Some of these files, when checked with git-annex whereused --key=, some were files I had maybe re-arranged in folders. I'm running a script at the moment to get a wider sample of what's happening with these duplicates.

One of the problems I speculate may be coming into play is that I did originally copy files to the cloud remote via an older, Windows git-annex version (repo version I think at 5?).

I think over time interoperability issues have ironed out, but this might be a remnant of when these were more prolific.... though I do still have one that I will report separately once I've gather more data.

Any advise on how I might deal with the duplicates, get them deleted out of the bucket?

RSS Atom

comment 1

OK, please ignore. These scenarios account for only about 23 of the duplicates. There may still be, or have been, a problem.. maybe? Because I did add the same file in two different places,with the two case differing (but same) suffixes.

I am running another audit to work out why I have so many more files in the cloud than reachable in the annex. I suppose it's because I removed files previously, but I'm not sure.

Comment by beryllium — Tue Oct 17 10:16:23 2023

Remove comment

comment 2

Apologies again. I've realised that my count of the find command is totally wrong. It should be:

$ find . -path .git -o \( \! -type d -print \) | wc -l
<actually larger number... which is worrying in its own right>"

The reason being... for some reason I have some files that contain the key, and some actual symbolic links, all intermixed.

Is there something I should do to make this all uniform?

Comment by beryllium — Tue Oct 17 10:28:51 2023

Remove comment

comment 3

git-annex preserves the filename extension as-is, it does not try to normalize it to lower case or anything. See discussion at Lower-case extension for SHA256E and similar.

You can configure git-annex to use SHA256 rather than the default SHA256E so the extension is not used and it will deduplicate better. Eg, in .gitattributes:

* annex.backend=SHA256

You could then run git-annex migrate on existing files to switch them to that backend. Bear in mind that you would have to re-upload the files to your special remote, and would have to git-annex dropunused --from that remote to clean up the duplicate data stored there.

Comment by joey — Fri Oct 20 17:52:06 2023

Remove comment

Add a comment