Please describe the problem.
I'm experimenting with the compute special remote by trying to convert FLAC files to .opus.
Some of the music files have unicode characters in the filename, which leads to an incorrect error message saying that the file is not checked into the repository.
It is possible that I'm just doing something wrong here, but as far as I can tell, the unicode characters are simply stripped by git-annex.
What steps will reproduce the problem?
- Commit a file with unicode characters in the filename to the git repository
- Invoke a compute remote with that file
- git annex complains that the file is not checked into the git repository
What version of git-annex are you using? On what operating system?
I'm running on Linux and my locale is de_DE.UTF-8:
$ locale
LANG=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
git-annex was installed using Homebrew.
git-annex version: 10.20250520
build flags: Pairing DBus DesktopNotify TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.4 bloomfilter-2.0.1.2 crypton-1.0.4 DAV-1.3.4 feed-1.3.2.1 ghc-9.8.4 http-client-0.7.19 persistent-sqlite-2.13.3.1 torrent-10000.1.3 uuid-1.3.16
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external compute mask
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10
Please provide any additional information below.
Here is a minimal reproduction of the problem:
$ git init compute-unicode
$ cd compute-unicode
$ touch "A filename without Unicode characters.txt"
$ touch "Ä filename with Unicöde chäracters.txt"
$ git add .
$ git commit -m "Demo"
[main (Root-Commit) 3655a71] Demo
2 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 A filename without Unicode characters.txt
create mode 100644 "\303\204 filename with Unic\303\266de ch\303\244racters.txt"
$ git annex init
init ok
(recording state in git...)
$ git annex initremote passthrough type=compute program=git-annex-compute-passthrough
initremote passthrough ok
(recording state in git...)
$ git annex addcomputed --to=passthrough "A filename without Unicode characters.txt" works.txt
addcomputed passthrough
(adding works.txt...) (checksum...)
ok
(recording state in git...)
$ git annex addcomputed --to=passthrough "Ä filename with Unicöde chäracters.txt" fails.txt
addcomputed passthrough
git-annex: The computation needs an input file that is not checked into the git repository: filename with Unicde chracters.txt
failed
addcomputed: 1 failed
Note how the unicode characters are simply missing in git-annex's message: " filename with Unicde chracters.txt".
I first thought this was a problem with my script, but it seems that git-annex strips the Unicode characters before invoking it.
The passthrough-remote looks like this (adapted from the ImageMagick example):
#!/bin/sh
set -e
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Specify the input file, followed by the output file." >&2
echo "Example: input.txt output.txt" >&2
exit 1
fi
echo "INPUT: $1" > /tmp/passthrough.log
echo "OUTPUT: $2" >> /tmp/passthrough.log
echo "INPUT $1"
read input
echo "OUTPUT $2"
read output
if [ -n "$input" ]; then
cat "$input" > "$output"
fi
The log file in /tmp/passthrough.log doesn't have the Unicode characters:
INPUT: filename with Unicde chracters.txt
OUTPUT: fails.txt
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I've been happily managing my important data (as well as things like my music collection) with git-annex for a few years now, with it making sure that everything has several copies on different external storage media.
That's unusual. Linux and Homebrew? I just want to check you didn't typo there and mean to say you're on OSX.
Tried just now (including the same locale setting) and it does not fail for me:
There are 3 possibilities here:
The best way to track down which of these is the problem is
strace
, so could you please try this:Here's how that strace looks for me, when the characters are making it through unscathed:
(I commented out the passthrough.log writing from the script to keep the strace easier to follow.)
I don't see how git-annex could be stripping even invalid unicode here. When it runs the compute program it uses
process
withCreatePipe
. That is documented to use the default encoding. git-annex sets the default encoding inuseFileSystemEncoding
.With that said, git-annex is here using
hGetLineUntilExitOrEOF
, and ifhGetChar
ever failed with an encoding error, it does look like that would skip over the problem and return the rest of the string.It would not hurt to throw in a
fileEncoding
on the compute process's handles, but I'd really want to be able to reproduce this first.I have also tried with filenames that are not valid unicode at all, and they pass through ok. Eg:
Thank you for taking the time to look into this!
I am indeed using Homebrew on Linux. I'm on Bluefin, which uses Fedora Silverblue as a base. Software there is generally installed either as Flatpak or via Homebrew because the root image is immutable.
I see the behavior on both my desktop and laptop, both running a recent Bluefin version (bluefin-dx:latest, based on Fedora Silverblue 42), but it just occurred to me that I could try it in a virtual machine, too.
When using a slightly older release of Bluefin I had on that VM, everything worked fine, but when I updated to the latest version, the
addcomputed
command started failing. Interestingly it works fine with files that were created before the update -- including with unicode filenames --, but when I create a new file with unicode characters after updating to the latest image, addcomputed fails on those, which seems to indicate this is likely not a git-annex problem after all.After a bit of research, I found this Linux problem that broke unicode handling in filenames, but I'm by no means certain that that is the cause of the problem, and if it is, there might be nothing you can do in git-annex to fix it.
Unless you want to pursue this further, I'm fine with just closing the bug as not applicable.
Still, I've added the requested strace log below -- I couldn't see a meaningful difference between the logs that worked, and the ones that failed, other than the failure itself and the missing unicode character escapes.
Grepping the failure strace log for "filename" yields the following:
Also with the /tmp/passthrough.log commented out.
I haven't used strace before, but if I'm reading this right, it looks like the characters get lost as or after git-annex receives them, but before the passthrough script is called. There is a ton of output between the git-annex execve (14498) and the one for the passthrough script (14507), mostly seems to be loading libraries and examining the .git directory. It also loads the git.mo translation files and system locale settings in-between, but there is no obvious point of failure.
Interestingly I get the same behavior for the invalid byte sequence example as for the unicode characters:
They are simply stripped.
Nice work investigating this. I would not have guessed a kernel bug might be involved. But I am not convinced one is, either.
I agree with your analysis of your strace. The filename is getting into git-annex ok. Then it runs the compute program with the mangled filename.
I don't see how a kernel bug would cause git-annex to mangle the filename though. As far as
git-annex addcomputed
is concerned, the filename is just a parameter to use as input to the computation. Such parameters are not limited to filenames actually. And so they pass throughgit-annex addcomputed
without being exposed to any kernel syscall that might do something wrong on a buggy kernel.Unless, that is, the haskell
process
library, or indeed the kernel itself, does something with parameters passed to the compute program.(This strace does rule out my theories around
hGetLineUntilExitOrEOF
.)What are the versions of git-annex in the VM where it worked vs where it didn't?
And, if you can possibly download and unpack the linuxstandalone tarball, and use that to run git-annex in the bad VM, that would be a useful check that the problem does not somehow involve the homebrew build. https://git-annex.branchable.com/install/Linux_standalone/
The version on the VM is the same one I reported in the initial post: 10.20250520, installed via Homebrew. git-annex wasn't originally installed on that VM, so I installed it at that version to test it.
When everything worked at first, I updated the VM to the Bluefin version I was running on my laptop, thinking that might be the problem, and then had the strange results I reported above.
Since the git-annex installation itself had not changed between when things worked and when they stopped, I started to suspect something like the kernel bug I mentioned (because the Kernel had changed).
I'm now also having trouble reproducing the problem in the VM at all. The files that were failing before are now added without problems again, as are newly created files -- though I had had to shut down and later restart the VM. I wish I had thought of making a full snapshot when I started experimenting, but I didn't.
The only machine that exhibits the problem consistently now is my laptop.
With the standalone tarball (
10.20250521-g1a9e6bf26b56c39429d4a096bf733e57e5684e1b
, using the./runshell
)addcomputed
works as expected on my laptop -- Unicode characters are shown with the backslash escape, whereas the Homebrew build alone fails by stripping the unicode characters.Hm.
Running the
git-annex
executable from/home/linuxbrew/.linuxbrew/bin/
inside of the runshell works as well -- it doesn't strip the characters. That might mean that it is not the Homebrew build that is broken, but that something about my environment is simply screwed up.