I wanted to implement management and synchronization of descriptive files (README.md etc) on the top of the large S3 bucket via git-annex so I could keep files in a git repo and rely on importree/exporttree functionality to keep bucket and repo in sync.
Looking at special_remotes/S3/ I didn't spot any option to achieve that.
I am not sure what would be the best option for this, given that greedy me might want to also eventually sync some docs/ prefix there: may be could be a white list of some keys/paths to include and/or exclude? May be some preferred content include expression could be specific enough to not demand full bucket traversal (unrealistic in feasible time) but rather limit to top level, e.g. include=^docs/ and include=^*.md or smth smarter?
This preferred content expression will match only md files in the top, and files in the docs subdirectory:
include=docs/* or (include=*.md and exclude=*/*)I got this wrong at first; this version will work! The
"include=*.md"matches files with that extension anywhere in the tree, so the"exclude=*/*is needed to limit to ones not in a subdirectory.Only preferred content is downloaded, but S3 is still queried for the entire list of files in the bucket.
I do think it would be possible to avoid the overhead of listing the contents of subdirectories that are not preferred content. At least sometimes.
When a bucket is listed with a "/" delimiter, S3 does not recurse into subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...", the response will list only the file "foo", and CommonPrefixes contains "bar" and "baz".
So, git-annex could make that request, and then if
"include=bar/*"is not in preferred content, but"include=foo/*"is, it could make a request to list files prefixed by "foo/". And so avoid listing all the files in "bar".If preferred content contained
"include=foo/x/*"and"include=foo/y/*", when CommonPrefixes includes "foo", git-annex could follow up with 2 requests to list those subdirectories.So this ends up making at most 1 additional request per subdirectory included in preferred content.
When preferred content excludes a subdirectory though, more requests would be needed. For
"exclude=bar/*", if the response lists 100 other subdirectories in CommonPrefixes, it would need to make 100 separate requests to list those while avoiding listing bar. That could easily be more expensive than the current behavior. So it does not seem to make sense to try to optimise handling of excludes.There are some complications in possible preferred content expressions:
"include=foo*/*"-- we want"foo/*"but also"foooooom/*"... but what if there are 100 such subdirectories? It would be an unexpected cost to need to make so many requests. Like exclude=, the optimisation should not be used in this case."include=foo/bar"-- we want only this file.. so would prefer to avoid recursing through the rest of foo. If there are multiple ones like this that are all in the same subdirectory, it might be nice to make one single request to find them all. But this seems like an edge case, and one request per include is probably acceptable.Here's a design:
"include=foo*/*"case)"include=foo/bar/*", set prefix to"foo/bar/", but for"include=foo/*bar", set prefix to"foo/". And for"include=foo/bar", set prefix to"foo/".Note that, step #1 hides some complexity, because currently preferred content is loaded and parsed to a MatchFiles, which does not allow introspecting to get the expression. Since we only care about include expressions, it would suffice to add to MatchFiles a
matchInclude :: Maybe Stringwhich gets set for includes.Unfortunately, that design doesn't optimize the preferred content expression that you were wanting to use:
include=docs/* or (include=*.md and exclude=*/*)In this case, the exclude limits the include to md files in the top directory, not subdirectories, but with the current design it will recurse and find all files to handle the
include=*.md.To optimise that, it needs to look at when includes are ANDed with excludes. With
"exclude=*/*", only files in the root directory can match, and those are always listed. So, that include can be filtered out before step #3 above.The other cases of excludes that can be ANDed with an include are:
exclude=bar/*-- This needs to do a full listing, same reasons I discussed in comment 2.exclude=*/foo.*-- Also needs a full listing.exclude=foo-- Also needs a full listing.exclude=foo.*-- Also needs a full listing.exclude=*[/]*-- Same as "exclude=/"exclude=*[//]*-- Same (and so on for other numbers of slashes).exclude=*/**-- Same (and so on for more asterisks in the front or back)exclude=*[/]**-- Same (and so on for more slashes and asterisks in the front or back)exclude=*-- Pointless to AND with an include since the combination can never match. May as well optimise it anyway by avoiding a full listing.exclude=**-- Same as above (and so on)