Recent comments posted to this site:
The assistant has some very tricky, and probably also fragile code that gathers related inotify events. That would need to be factored out for this.
Looked in more detail into fixing this by moving the ignore check to after
a set of files has been gathered and fed through git ls-files. Unfortunately
that will be complicated significantly by the fact that, after the ignore check
it currently does things like re-writing symlinks to annex objects when the
link target needs updating. There is a chicken and egg problem here,
because the type of Change that gets queued depends on parts of that same
code having run.
BTW: Another way this same bug can manifest is that an annex object is added to a submodule, and the assistant updates its symlink to point out of the submodule, to the wrong annex objects directory.
There is some very delicate timing going on in
Assistant.Threads.Committer in order to gather Changes that happen close
together in time. Which makes me think that even a simple approach of
running git ls-files once per changed file, before the ignore check,
might throw the timing off enough to be a problem. As well as being murder
on the CPU when eg, a lot of files have been moved around.
Note that replace assistant with assist would fix this bug,
since git-annex assist does use git ls-files. Not that implementing
that would be any easier than just fixing this bug. But, fixing this bug
moves the assistant in the direction of that todo one way or the other.
Congrats I guess, that's the first LLM-generated patch to git-annex, and it seems approximately correct.
It was unambiguously helpful to get the hint that Remote/Git.hs:485
was the location of the bug. That probably saved 10 minutes of my time.
But, I probably would have found it easier to fix this on my own without seeing that patch than it was to fix it given that patch. I had to do a considerable amount of thinking about whether the patch was correct, or just confidently sounding incorrect in a different manner than a human-generated patch would be. (Not helped, certainly, by this being an area of the code with no type system guardrails helping it be correct.)
For one thing, I wondered, why does it use isUnescapedInURIComponent rather than isUnescapedInURI? The latter handles '/' correctly without needing a special case.
Being faced with an LLM-generated patch also meant that I needed to consider what its license is. I was faced with needing to clean-room my own version, which is a bit difficult given how short the patch is (while probably still long enough to be copyrightable).
But, it turns out that git-annex already contains essentially the same code in Remote/S3.hs, in genericPublicUrl:
baseurl Posix.</> escapeURIString skipescape p
where
-- Don't need to escape '/' because the bucket object
-- is not necessarily a single url component.
-- But do want to escape eg '+' and ' '
skipescape '/' = True
skipescape c = isUnescapedInURIComponent c
This code was presumably in the LLM's training set, and certainly appeared to be available to it for context, so its mirroring of this could simply be a case of Garbage In, Garbage Out.
Note that "skipescape" is a much better name than the LLM-generated "escchar" which behaves backwards from what its name suggests.
Why did I use isUnescapedInURIComponent in that and isUnescapedInURI in Remote/WebDav/DavLocation.hs? I doubt there was a good reason for either choice, but a full analysis did find a reason to prefer the isUnescapedInURIComponent approach, to handle a path containing '[' or '].
So, in 8fd9b67ed82ca0f39796a8d59431d42a7eb84957, I've factored out a general purpose function, and fixed this bug by using it.
SHA256E keys (like SHA256E-s107998--4545...) contain only alphanumeric characters
The keyFile encoding produces no % or & characters
Incorrect statements FWIW.
Certian SHA*E keys will also be affected by this bug.
TL;DR, the patch
diff --git a/Remote/Git.hs b/Remote/Git.hs
index 6b7dc77d98..4faaea082d 100644
--- a/Remote/Git.hs
+++ b/Remote/Git.hs
@@ -482,7 +482,12 @@ inAnnex' repo rmt st@(State connpool duc _ _ _ _) key
keyUrls :: GitConfig -> Git.Repo -> Remote -> Key -> [String]
keyUrls gc repo r key = map tourl locs'
where
- tourl l = Git.repoLocation repo ++ "/" ++ l
+ tourl l = Git.repoLocation repo ++ "/" ++ escapeURIString escchar l
+ -- Escape characters that are not allowed unescaped in a URI
+ -- path component, but don't escape '/' since the location
+ -- is a path with multiple components.
+ escchar '/' = True
+ escchar c = isUnescapedInURIComponent c
-- If the remote is known to not be bare, try the hash locations
-- used for non-bare repos first, as an optimisation.
locs
seems to work well. Built in https://github.com/datalad/git-annex/pull/251 (CI tests still run), tested locally:
❯ /usr/bin/git-annex version
git-annex version: 10.20260115+git119-g43a3f3aaf2-1~ndall+1
build flags: Assistant Webapp Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV Servant
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
...
❯ /usr/bin/git-annex get --from origin video.mkv
get video.mkv (from origin...) ok
(recording state in git...)
to work. Here is claude's analysis which lead it to the fix:
Bug Analysis: fails_to_get_from_apache2_server_URL_backend_file
Root Cause
The bug is in Remote/Git.hs:485 — the keyUrls function constructs URLs by simple string concatenation without URL-encoding the path components:
tourl l = Git.repoLocation repo ++ "/" ++ l
How the failure occurs
1. Key: URL--yt:https://www.youtube.com/watch,63v,613ZXfZfnRfyM
2. keyFile encoding (Annex/Locations.hs:783-795) converts : → &c and / → %:
URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM
3. keyUrls concatenates this directly into the URL path:
https://datasets.datalad.org/.../.git//annex/objects/zZ/3v/URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM/...
4. parseURIRelaxed (Utility/Url/Parse.hs:45-47) tries to parse this URL. It calls escapeURIString isAllowedInURI first, but % is allowed in URIs (it's the percent-encoding introducer), so it passes through unescaped.
5. parseURI then sees %%w and %wa which are invalid percent-encoding sequences (% must be followed by two hex digits). The parse fails, returning Nothing.
6. download' (Utility/Url.hs:389-391) hits the Nothing branch and returns "invalid url".
Why SHA256E keys work
SHA256E keys (like SHA256E-s107998--4545...) contain only alphanumeric characters, -, and .. The keyFile encoding produces no % or & characters, so the concatenated URL is always valid.
The fix
keyUrls in Remote/Git.hs:485 needs to URL-encode the path components. Other remotes already do this:
- S3 (Remote/S3.hs:1221-1229): uses escapeURIString with a custom predicate keeping / but encoding everything else
- WebDAV (Remote/WebDAV/DavLocation.hs:35): uses escapeURIString isUnescapedInURI
A colleague used a wrong config, which was pointing to minio console rather than the S3 endpoint. When they ran initremote, the console wrongfully replied 200-OK when PUTting the annex-uuid file, same when they then pushed the data. The minio console always redirect to a login page, and doesn't fail on PUT ( which is non-compliant ). So the dataset recorded all the data being present in that remote, while there was no trace of any buckets or objects in the S3.
steps to reproduce:
git init test_s3
cd test_s3/
git-annex init
export AWS_ACCESS_KEY_ID=john AWS_SECRET_ACCESS_KEY=doe
git annex initremote -d test_remote host="play.min.io" bucket="test_bucket" type=S3 encryption=none autoenable=true port=9443 protocol=https chunk=1GiB requeststyle=pathecho test > test_annexed_file
git-annex add test_annexed_file
git commit -m 'add annexed file'
git-annex copy --fast --to test_remote
I am showing it with --fast flag here, as this is what datalad uses by default. Without --fast, it fails with (HeaderException {headerErrorMessage = "ETag missing"}) failed which is better.
So to sum it up, the unfortunate circumstances are:
- the initremote PUT of annex-uuid is not performing check that the annex-uuid file was effectively pushed in a bucket.
- minio console replies with 200-OK for all http requests
- datalad uses
push --fastby default, which recorded files as being pushed without performing a HEAD after push. I guess that's for performance reason, but that is dangerous if a server or reverse-proxy ends-up responding 200-OK to all requests after init.
Thanks for your help!
My remote (version 1) uses a database that is multithreaded but has process-level locking. Despite using async, multiple remote processes are still being started in testremote. Right now I have it working with POSIX advisory locks and open/close the database for each operation in each thread in each process, but that's a lot of overhead. Is there a better way to do this? I could make them coordinate via IPC, or have them release the lock only when idle/when others are waiting, but it seems like it shouldn't be that complex.
I get clear failures when I use testremote. On real workloads (with -j 24) it is more confusing. There are not errors, but at some point the git-annex command hangs. Quite possibly a bug in my code, given testremote is failing.
I guess my question is: Is there a way to force git-annex to only use one special remote process, either by configuration or by having all but the first return "use the other one" (without -j 1 always)? And does the way this is handled differ between actual use and testremote?
Or to put it another way: how do you envision one should design a special remote that supports concurrency and relies on a database with process-level locking?
This is due to the assistant not supporting submodules. Nothing has ever been done to make it support them.
When git check-ignore --stdin is passed a path in a submodule, it exits.
We can see this happen near the top of the log:
fatal: Pathspec 'code/containers/.codespellrc' is in submodule 'code/containers'
git check-ignore EOF: user error
The subseqent "resource vanished (Broken pipe)" are each time git-annex tries to talk to git check-ignore.
Indeed, looking at the source code to check-ignore, if it's passed a path inside a submodule, it errors out, and so won't be listening to stdin for any more paths:
joey@darkstar:~/tmp/t>git check-ignore --stdin
r/x
fatal: Pathspec 'r/x' is in submodule 'r'
- exit 128
And I was able to reproduce this by having a submodule with a file in it, and starting the assistant.
In some cases, the assistant still added files despite check-ignore having crashed. (It will even add gitignored files when check-ignore has crashed.) In other cases not. The problem probably extends beyond check-ignore to also staging files. Eg, "git add submodule/foo bar" will error out on the file in the submodule and not ever get to the point of adding the second file.
Fixing this would need an inexpensive way to query git about whether a file
is in a submodule. Passing the files that
the assistant gathers through git ls-files --modified --others
might be the only way to do that.
Using that at all efficiently would need some other changes, because it needs to come before the ignore check, which it currently does for each file event. The ignore check would need to be moved to the point where a set of files has been gathered, so ls-files can be run once on the set of files.