Hi,
Is it possible to use hashdirlower layout with S3 special remote, instead of storing all the files in the root of the bucket?
Hi,
Is it possible to use hashdirlower layout with S3 special remote, instead of storing all the files in the root of the bucket?
That comment was just bad wording or possibly I conflated S3 with how other remotes that do use hash directories work. I've corrected the comment.
According to Amazon's documentation, S3 does not have a concept of directories; "foo/bar" and "foo_bar" and "foo\bar" are all just opaque strings as far as it's concerned. So I don't see any point in using hash directories with S3.
aws s3 ls
andaws s3 cp
do have the concept of directories; (2) An S3 tips page says that "latency on S3 operations depends on key names since prefix similarities become a bottleneck at more than about 100 requests per second. If you have need for high volumes of operations, it is essential to consider naming schemes with more variability at the beginning of the key names, like alphanumeric or hex hash codes in the first 6 to 8 characters, to avoid internal “hot spots” within S3 infrastructure."Thanks Joey for your answer. However, even if directories are indeed totally virtual in S3, file paths are split using "/" characters when using S3 web console, for example (it's the same for Wasabi, for example). It's quite convenient!
Moreover, some S3-compatible services do create directories using "/" delimiters.
The stuff @Ilya found about prefix similaries causing bottlenecks in S3 infra is interesting. git-annex keys have a common prefix, and have the hash at the end. So it could have a performance impact.
But that info also seems out of date when it talks about 6-8 characters prefix length. And the rate limit has also been raised significantly, to 3000-5000 ops/sec per prefix. See https://stackoverflow.com/questions/52443839/s3-what-exactly-is-a-prefix-and-what-ratelimits-apply
From that, it seems S3 does actually treat '/' as a prefix delimiter. (Based on a single not very clear email from Amazon support and not documented anywhere else..) So a single level of hash "directories" could increase the rate limit accordingly.
If a git-annex exceeded those rate limits, it would start getting 503 responses from S3, so it wouldn't slow down but would instead fail whatever operation it was doing. I can't recall anyone complaining of 503's from S3.