Recent comments posted to this site:

comment 4

Unfortunately, that design doesn't optimize the preferred content expression that you were wanting to use:

include=docs/* or (include=*.md and exclude=*/*)

In this case, the exclude limits the include to md files in the top directory, not subdirectories, but with the current design it will recurse and find all files to handle the include=*.md.

To optimise that, it needs to look at when includes are ANDed with excludes. With "exclude=*/*", only files in the root directory can match, and those are always listed. So, that include can be filtered out before step #3 above.

The other cases of excludes that can be ANDed with an include are:

  • exclude=bar/* -- This needs to do a full listing, same reasons I discussed in comment 2.
  • exclude=*/foo.* -- Also needs a full listing.
  • exclude=foo -- Also needs a full listing.
  • exclude=foo.* -- Also needs a full listing.
  • exclude=*[/]* -- Same as "exclude=/"
  • exclude=*[//]* -- Same (and so on for other numbers of slashes).
  • exclude=*/** -- Same (and so on for more asterisks in the front or back)
  • exclude=*[/]** -- Same (and so on for more slashes and asterisks in the front or back)
  • exclude=* -- Pointless to AND with an include since the combination can never match. May as well optimise it anyway by avoiding a full listing.
  • exclude=** -- Same as above (and so on)
Comment by joey
comment 3

There are some complications in possible preferred content expressions:

"include=foo*/*" -- we want "foo/*" but also "foooooom/*"... but what if there are 100 such subdirectories? It would be an unexpected cost to need to make so many requests. Like exclude=, the optimisation should not be used in this case.

"include=foo/bar" -- we want only this file.. so would prefer to avoid recursing through the rest of foo. If there are multiple ones like this that are all in the same subdirectory, it might be nice to make one single request to find them all. But this seems like an edge case, and one request per include is probably acceptable.

Here's a design:

  1. Get preferred content expression of the remote.
  2. Filter for "include=" that contain a "/" in the value. If none are found, do the usual full listing of the bucket.
  3. If any of those includes contain a glob before a "/", do the usual full listing of the bucket. (This handles the "include=foo*/*" case)
  4. Otherwise, list the top level of the bucket with delimiter set to "/".
  5. Include all the top-level files in the list.
  6. Filter the includes to ones that start with a subdirectory in the CommonPrefixes.
  7. For each remaining include, make a request to list the bucket, with the prefix set to the non-glob directory from the include. For example, for "include=foo/bar/*", set prefix to "foo/bar/", but for "include=foo/*bar", set prefix to "foo/". And for "include=foo/bar", set prefix to "foo/".
  8. Add back the prefixes to each file in the responses.

Note that, step #1 hides some complexity, because currently preferred content is loaded and parsed to a MatchFiles, which does not allow introspecting to get the expression. Since we only care about include expressions, it would suffice to add to MatchFiles a matchInclude :: Maybe String which gets set for includes.

Comment by joey
comment 2

I do think it would be possible to avoid the overhead of listing the contents of subdirectories that are not preferred content. At least sometimes.

When a bucket is listed with a "/" delimiter, S3 does not recurse into subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...", the response will list only the file "foo", and CommonPrefixes contains "bar" and "baz".

So, git-annex could make that request, and then if "include=bar/*" is not in preferred content, but "include=foo/*" is, it could make a request to list files prefixed by "foo/". And so avoid listing all the files in "bar".

If preferred content contained "include=foo/x/*" and "include=foo/y/*", when CommonPrefixes includes "foo", git-annex could follow up with 2 requests to list those subdirectories.

So this ends up making at most 1 additional request per subdirectory included in preferred content.

When preferred content excludes a subdirectory though, more requests would be needed. For "exclude=bar/*", if the response lists 100 other subdirectories in CommonPrefixes, it would need to make 100 separate requests to list those while avoiding listing bar. That could easily be more expensive than the current behavior. So it does not seem to make sense to try to optimise handling of excludes.

Comment by joey
comment 1

This preferred content expression will match only md files in the top, and files in the docs subdirectory:

include=docs/* or (include=*.md and exclude=*/*)

I got this wrong at first; this version will work! The "include=*.md" matches files with that extension anywhere in the tree, so the "exclude=*/* is needed to limit to ones not in a subdirectory.

Only preferred content is downloaded, but S3 is still queried for the entire list of files in the bucket.

Comment by joey
How do I get GETGITREMOTENAME to work in INITREMOTE?

I am writing a external special remote using this protocol. This is little similar to the directory remote and there's a path on the local system where content is stored.

I don't want this location to be saved in the git-annex branch and I thought I'll be able to use GETGITREMOTENAME to persist it myself. However, I'm running into an issue where GETGITREMOTENAME fails during INITREMOTE (presumably since the remote has not yet been created). It does work during Prepare, but that feels a bit late to ask for a required piece of configuration.

What are my options? My ideal behavior would be if it behaves very similar to directory= field in directory remote, but I can hand-manage it too if that's the recommendation as long as I get some identifier for this remote (there can be multiple of these in the same repo)

Comment by Katie
comment 7

The www-authenticate header is also sent when the request for /config is a 401. So git-annex can use that to set the wwwauth field.

The capability fields are indicating capabilities of git. I checked and git-credential-oauth does not rely on those capabilities.

(Wildly, git-credential-oauth is looking for "GitLab", "GitHub", and "Gitea" in order to sniff what backend it's authenticating to, and that's all it uses the wwwauth for.)

Comment by joey
comment 6

Forgejo-aneksajo also creates the repository for requests to /config, and will git-annex-init it if the request comes from a git-annex user agent and the user has write permissions.

Hmm, then git-annex pull will create a repository. Which is going further than "push to create".

I do think my idea in comment #2 would be better than how you implemented that. But it's also not directly relevant to this bug report.

I did open support push to create.

Comment by joey
comment 5

git push seems to first make a GET request for something like /m.risse/test-push-oauth2.git/info/refs?service=git-receive-pack, which responds with a 401 and www-authenticate: Basic realm="Gitea" among the headers. Git then seems to pass this information on to the git-credential-helper.

git annex push likewise receives a 401 response from the /config endpoint with the same www-authenticate header, so it could pass it on to the credential helper too.

I am not sure where the capabilitys are coming from...

Comment by matrss
comment 4

The chicken-and-egg problem you are describing is actually something msz has already encountered and reported, but that issue is fixed: Forgejo-aneksajo also creates the repository for requests to /config, and will git-annex-init it if the request comes from a git-annex user agent and the user has write permissions. More about that here:

So that's not it... I've investigated a bit and I think I led you astray with the comment about a "non-existing repository". I am also seeing the issue with a pre-created repository, and even with a pre-created and git-annex-init'ialized repository.

The issue is actually that for ATRIS I rely on git-credential-oauth's "Gitea-like-Server" discovery here: https://github.com/hickford/git-credential-oauth/blob/f01271d94c70b9280c19f489f90c05e9aba0d757/main.go#L206

When doing a git push origin main the git-credential-oauth helper actually receives this request:

$ git push origin main
capability[]=authtype
capability[]=state
protocol=https
host=atris.fz-juelich.de
wwwauth[]=Basic realm="Gitea"

while with git annex push it is just this:

$ git annex push
protocol=https
host=atris.fz-juelich.de

Git-credential-oauth recognizes that it is talking to a Gitea/Forgejo server based on this wwwauth[]=Basic realm="Gitea" data. Without it and in the absence of a more specific configuration for the server it doesn't try to handle it and falls back to the standard http credential handling of git. I am not sure where these capability and wwwauth fields are coming from, but I think git-annex should somehow do the same as git here...


I've gotten at the data git sends to the credential helper with this trivial script:

$ cat ~/bin/git-credential-echo 
#!/usr/bin/env bash

exec cat >&2

and configuring it as my credential helper.

I have to say, I like this pattern of processes communicating over simple line-based protocols :)

Comment by matrss