Please describe the problem.
FTR, I can work around this specific problem (somehow, not sure how yet); I also didn't intend to report two bugs in 10.20230626 today, but I found them both in a specific repo while trying to work around new error output from 10.20230626, and both of them seemed to justify being reported. (To avoid future confusion, other bug, relating to change in meaning of sync.)
In short, I've ended up with the "same" podcast known in git-annex by two different encodings of the filename, both of which appear in git annex list
, but only one of which appears in the checked out annex file system.
While git annex list
can show these uniquely, there doesn't appear to be a way to identify the relevant file to operate on uniquely to, eg, git annex drop
, or git annex whereis
, or even git annex list
being more specific. They do not appear to accept the "encoded format" that is output by git annex list
, which makes roundtriping filenames printed out difficult. And since the files just differ by encoding, I'm not even sure if there is any way to specify one of them.
I found this while trying to debug what had happened with a podcast downloaded from:
https://popculturedetective.agency/feed/podcast/
with git-annex; specifically the episode with the title starting "A Conversation with Artist Simon ...", where the artists surname has a latin accented character in it.
My git-annex list is showing two references to that podcast, with subtly different names:
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ git annex list | grep "A_Conv"
XXXX_ "A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3"
XXXX_ "A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3"
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$
one of which matches the file now checked out on disk in that directory:
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ LANG=C ls -B | grep A_Conv
A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$
and the other does not. As best I can tell \303\245
is the UTF-8 for U+00E5 ("a" with small ring above), and \314\212
is the UTF-8 for U+030A (combining character, small ring above; after an "a"). So both are somewhat legitimate ways to encode that particular accented "a".
The podcast feed (now?) has the 61 cc 8a
varient in the title of the podcast episode (ie, a, plus combining ring; equivalent to a\314\212
as git annex now encodes it).
Digging back through the git history, it appears I had the archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3
variant by 2022-11-04, and downloaded the archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3
version on 2022-11-03, the day before. (My commit comment from that date implies I was fixing libsyn filenames, I think to remove a URL suffix on them and/or avoid duplicate downloads; so I may also have upgraded git annex around that timeframe.)
I also have a file hard linked to this podcast (from when it was first downloaded, 2022-11-03) which has the other encoding, implying that at one point git-annex put the other variation into the checked out files (since I hard link all "newly downloaded" files into another directory, targetted at the content file inside the annex, to make them easier to play back).
ewen@basadi:~/Music/podcasts$ ls -ilB A_Conversation_with_Artist_Simon_Stålenhag.mp3
19692405 -r--r--r-- 2 ewen staff 42854272 3 Nov 2022 A_Conversation_with_Artist_Simon_Stålenhag.mp3
ewen@basadi:~/Music/podcasts$ ls -ilBL archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_Stålenhag.mp3
19692405 -r--r--r-- 2 ewen staff 42854272 3 Nov 2022 archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_Stålenhag.mp3
ewen@basadi:~/Music/podcasts$ ls -ilB archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_Stålenhag.mp3
19757493 lrwxr-xr-x 1 ewen staff 206 4 Nov 2022 archive/Pop_Culture_Detective__Audio_Files/A_Conversation_with_Artist_Simon_Stålenhag.mp3 -> ../../.git/annex/objects/Xm/3v/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3
ewen@basadi:~/Music/podcasts$ ls -il .git/annex/objects/Xm/3v/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3
19692405 -r--r--r-- 2 ewen staff 42854272 3 Nov 2022 .git/annex/objects/Xm/3v/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3/SHA256E-s42854272--5b156789d2152e69dd0738bb75d42ddd9172a891e9646cc53d86963bd6014dc2.mp3
ewen@basadi:~/Music/podcasts$
Given the Podcast feed itself (still) contains the UTF-8 (hex) 61 cc 8a
, which is the version I seem to have from first download, it feels like git annex might have changed to canonicalisng the UTF-8 in a way that it didn't previously and handled having files with the "old" encoding by (a) changing to the new coding (eg, in the checked out file) and (b) retaining the old and new encodings in the list of files. (And it seems like this would have happened in a 2022 release of git-annex.)
What seems to have changed with 10.20230626 is (from the 10.20230626 changelog):
* Many commands now quote filenames that contain unusual characters the
same way that git does, to avoid exposing control characters to the
terminal.
which makes sense as far as it it goes, so now the two different encodings known to git annex are visible in the "list".
But the format in the output for filenames containing UTF-8 is not accepted by, eg, "git annex drop", which threw error messages, which I noticed today.
In particular (a) commands receiving file names do not seem to understand these escaped versions, which makes round tripping of filenames (eg "git annex list" filtered and then handed back to "git annex drop") more difficult than all previous versions of git annex, (b) the UTF-8 characters are output as octal escapes of each individual byte, which makes matching them against filenames in the file system more difficult (although it seems usually they should match LANG=C ls -B
output -- I just got unlucky here with git annex learning about two variations on the name...).
What steps will reproduce the problem?
Something like:
git annex importfeed --template='archive/${feedtitle}/${itemtitle}${extension}' https://popculturedetective.agency/feed/podcast/
git annex list archive/Pop_Culture_Detective__Audio_Files
LANG=C ls -B archive/Pop_Culture_Detective__Audio_Files/*
git annex list "A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3"
but it may require first importing the feed with git-annex older than 10.20230626 and then again with git annex 10.20230626 (I'm not entirely clear how I ended up in this exact state; but that specific feed definitely was imported with an older git-annex first, I'm just not 100% certain how old).
What version of git-annex are you using? On what operating system?
Now git-annex 10.20230626, on macOS, installed from Home Brew; before git-annex 10.2022xxxx, on macOS, installed from Home Brew (around early November 2022):
ewen@basadi:~$ git annex version
git-annex version: 10.20230626
build flags: Assistant Webapp Pairing FsEvents TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.1 bloomfilter-2.0.1.0 cryptonite-0.30 DAV-1.3.4 feed-1.3.2.1 ghc-9.4.4 http-client-0.7.13.1 persistent-sqlite-2.13.1.1 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: darwin x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
ewen@basadi:~$
Please provide any additional information below.
Additional example of the "cannot use output name as input" problem:
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ git annex list | grep A_Conv
XXXX_ "A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3"
XXXX_ "A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3"
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ git annex list "A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3"
here
|bethel
||nas01
|||web
||||bittorrent
|||||
error: pathspec 'A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3' did not match any file(s) known to git
Did you forget to 'git add'?
list: 1 failed
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ git annex list "A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3"
here
|bethel
||nas01
|||web
||||bittorrent
|||||
error: pathspec 'A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3' did not match any file(s) known to git
Did you forget to 'git add'?
list: 1 failed
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$
And proof that the actual downloaded file origin is the same:
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$ git annex whereis | sed -n '/A_Conv/,/ok/p'
whereis "A_Conversation_with_Artist_Simon_Sta\314\212lenhag.mp3" (4 copies)
00000000-0000-0000-0000-000000000001 -- web
4e10813c-063b-4cd9-9680-db6685b1c5e8 -- bethel_data_drive [bethel]
680fc999-1dc8-465d-b5ae-5defdb18d019 -- basadi (Mac Mini 2020) [here]
9f693b73-3283-45fa-83d3-251d57da7cd3 -- Synology DS216+ [nas01]
web: https://popculturedetective.agency/podcast-download/18821/a-conversation-with-artist-simon-sta%cc%8alenhag.mp3
ok
whereis "A_Conversation_with_Artist_Simon_St\303\245lenhag.mp3" (4 copies)
00000000-0000-0000-0000-000000000001 -- web
4e10813c-063b-4cd9-9680-db6685b1c5e8 -- bethel_data_drive [bethel]
680fc999-1dc8-465d-b5ae-5defdb18d019 -- basadi (Mac Mini 2020) [here]
9f693b73-3283-45fa-83d3-251d57da7cd3 -- Synology DS216+ [nas01]
web: https://popculturedetective.agency/podcast-download/18821/a-conversation-with-artist-simon-sta%cc%8alenhag.mp3
ok
ewen@basadi:~/Music/podcasts/archive/Pop_Culture_Detective__Audio_Files$
(where I had to use that command variant, as there doesn't seem to be any input method to specify those two encodings separately )
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Thanks again for writing git-annex. It's been mostly trouble free for 10 years as a "podcatcher". Today seems to be something of a "find all the surprises" day, having upgraded git-annex (from Home Brew) earlier this week.
git-annex doesn't mess with filename encodings except for the new escaping on output. But as far as the path from podcast to disk, it's bytes in and bytes out with no concern for encoding. So I don't know how you got these two files, but it seems a reasonable hypotehesis that the podcast changed the title encoding (and perhaps then changed it back).
It's strange to me that you don't have two separate files visible on disk. These are different filenames after all. I wonder if you are using a filesystem or OS that treats those as the same filename. That would certianly make it harder to list the files some other way and choose one to paste to
git rm
.Note that git has the same problem of many commands not accepting filenames output by other commands.
The best workaround I know of is to use
git -c core.quotePath=false
to get it to output the actual filenames.I do think it might be worth adding an option to git-annex to make it accept the quotepath formatted filenames. I don't think it can by default.
Yes, it's also somewhat strange to me that I ended up with two variations of this filename (by UTF-8 encoding) on disk.
The filesystem is the default modern macOS one (APFS, case preserving but case insensitive). From a quick experiement, it looks like the encoding of "a with circle above" is one of the "case preserving but case insensitive" things of the APFS file system. Ie, I can create either variation, but which ever variation is created first is treated the same as the other variation when it comes to opening/updating the file. Which I guess makes sense, as they're two different UTF-8 encodings of ultimately the same glyph.
FWIW, I also noticed while investigating this, this morning, that the encoding on disk in my annex had switched back to consistently encoding in both my linked copy (2023-11-03) and the annex archive directory (2023-11-04). So with the two versions known to the annex, it feels like I might be seeing a "first to be created" race in the UTF-8 encoding used when sync runs.
Also FTR, in addition to the possibility that the podcast RSS changed encoding between the two runs, it's also possible that this got canonicalised by, eg, shell expansion, around 2023-11-03 / 2023-11-04; it definitely looks like I did a bunch of automated "git mv ..." to fix up filenames around that point. (And then possibly the next git annex podcast fetch re-learnt the other name.)
Since I seem to have stablised on one encoding no disk right now, I'm going to try to make git-annex forget about the other encoding of the name, to tidy up this particular confusion for now.
But I agree it would be helpful to have a command variant that can accept the octal-encoded byte sequences (now) output by
git annex list
. Both for cases like this, and in general for round tripping output back to input (something I do in some cases to handle scripted checks on annexed files against other things).Thanks for the reply,
Ewen
Interestingly, just renaming the file on disk (
git mv
) is sufficient to make the second (duplicate) entry go away, as the second one gets flagged as "deleted". And if I commit both changes, then it seems to be persistent. Ie, after the commit, I cangit mv
the file back to the original on-disk name, and commit that, andgit annex list
only shows the one name. That seems to survivegit annex sync --no-content
and even another run of my podcast fetching. So I think that dance solves my immediate "cannot reference by name" problem -- ie, move the one on disk aside, commit, move back, commit.(I still have a problem with my auto-cleanup automation for this repository --
git annex drop ...
if it's no longer linked into the "postcasts to play" repo -- but I'm fairly sure I can fix the detection of that somehow. And the few special cases that no longer auto-drop by "name fromgit annex list
" I can drop by hand via wildcards or tab-completion.)Other than the feature request (some way to feed the escaped output back in as input) I think this bug is resolved. Thanks for your comments.
Ewen
Interestingly,
git rm
does have a way to make it accept quotepath formatted filenames:I don't think that's the first thing I would have reached for though. It's not a common option supported by other git commands. I would have probably instead used
git ls-files -c core.quotePath=false
, filtered the output to only have one of the two related filenames, and passed it togit rm
as a parameter.If git had a common way to accept quotepath input, I'd think that git-annex should support it, but since it doesn't, I'm unsure that it's worth complicating git-annex, since core.quotePath=false can already be used.