importtree from S3 slows to halt even with prefix

Please describe the problem.

I have been running

git annex --debug import --from s3-dandiarchive master

from an S3 bucket which is versioned but I did not enable versioning for this "import" case (due to git-annex unable to sense versioning read-only) and expected it to "quickly" import tree (with about 7k files) from S3. Note that some of the keys have many older revisions for one reason or another.

But currently that process, started hours ago yesterday IIRC, is

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                     
3912831 dandi     20   0 1024.1g  51.7g  16000 S 100.0  82.4     19,48 git-annex

CPU heavy and very slow (now, started faster flipping through pages) on actually "importing" while listing a page every 30 seconds or so

[2024-11-12 14:59:23.587433059] (Remote.S3) Header: [("Date","Tue, 12 Nov 2024 19:59:23 GMT")]

[2024-11-12 14:59:58.073945529] (Remote.S3) Response status: Status {statusCode = 200, statusMessage = "OK"}
[2024-11-12 14:59:58.074057102] (Remote.S3) Response header 'x-amz-id-2': 'sxDUdIkuRLs3jjjTyIbFaI+cQqLCGpTXZNFcvykT2+F6OcqVRM2IMn6P1YquVrdH3fXmV9nRnTDs9EtOtctV05GptcIaBaF2'
[2024-11-12 14:59:58.07410232] (Remote.S3) Response header 'x-amz-request-id': 'Y35X1Z41GMF9PHY8'
[2024-11-12 14:59:58.074135941] (Remote.S3) Response header 'Date': 'Tue, 12 Nov 2024 19:59:24 GMT'
[2024-11-12 14:59:58.074167094] (Remote.S3) Response header 'x-amz-bucket-region': 'us-east-2'
[2024-11-12 14:59:58.074197609] (Remote.S3) Response header 'Content-Type': 'application/xml'
[2024-11-12 14:59:58.074228873] (Remote.S3) Response header 'Transfer-Encoding': 'chunked'
[2024-11-12 14:59:58.074259342] (Remote.S3) Response header 'Server': 'AmazonS3'
[2024-11-12 14:59:58.171273277] (Remote.S3) String to sign: "GET\n\n\nTue, 12 Nov 2024 19:59:58 GMT\n/dandiarchive/"
[2024-11-12 14:59:58.171355688] (Remote.S3) Host: "dandiarchive.s3.amazonaws.com"
[2024-11-12 14:59:58.17139206] (Remote.S3) Path: "/"
[2024-11-12 14:59:58.17142278] (Remote.S3) Query string: "prefix=dandisets%2F"
[2024-11-12 14:59:58.171463294] (Remote.S3) Header: [("Date","Tue, 12 Nov 2024 19:59:58 GMT")]

and not sure how many pages it got so far.

I suspect (can't tell from above) that it is using API to list all versions of keys, not just current version, even though I have not asked for versioned support.

Note: bucket is too heavy (about 300 million keys IIRC) to list all of it for all the versions. I do not have information ready on how many versions of keys in the dandisets/ prefix - could be some hundreds of thousands, but I would still expect/hope it to complete by now. Nothing seems to be done on filesystem or to git store yet (du says it is 280k total size) -- git-annex is just being fed information from S3.

What steps will reproduce the problem?

add s3 importtree special remote matching

bucket=dandiarchive datacenter=US encryption=none fileprefix=dandisets/ host=s3.amazonaws.com importtree=yes name=s3-dandiarchive port=80 publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous storageclass=STANDARD type=S3 timestamp=1731015643s

run annex import from it

What version of git-annex are you using? On what operating system?

invocation of static-git-annex-10.20241031 (build by kyleam https://git.kyleam.com/static-annex/ ... but I think I tried a different one before):

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandiset-manifests$ /home/dandi/git-annexes/static-git-annex-10.20241031/bin/git-annex version
git-annex version: 10.20241031
build flags: Pairing DBus DesktopNotify TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.2 bloomfilter-2.0.1.2 crypton-1.0.1 DAV-1.3.4 feed-1.3.2.1 ghc-9.8.3 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.16
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10

Calling this done although memory use improvements still seem possible.. --Joey

RSS Atom

comment 1

At the end (after over a day of torturing that poor bucket, whenever it took just few minutes for s3cmd sync to get everything including content) it crashed with

[2024-11-12 22:58:00.366878941] (Remote.S3) Response status: Status {statusCode = 200, statusMessage = "OK"}
[2024-11-12 22:58:00.373456754] (Remote.S3) Response header 'x-amz-id-2': 'DGXJztoRJRuHQrcOqs3FtnEUJomRz+53jawFoKoRbKQATcvAppqJcfcAVfR1d8cu7uepkEDvSXo='
[2024-11-12 22:58:00.384304583] (Remote.S3) Response header 'x-amz-request-id': 'W1PSPV7ZSBKJ7HTT'
[2024-11-12 22:58:00.38437407] (Remote.S3) Response header 'Date': 'Wed, 13 Nov 2024 03:50:18 GMT'
[2024-11-12 22:58:00.384436037] (Remote.S3) Response header 'x-amz-bucket-region': 'us-east-2'
[2024-11-12 22:58:00.384486611] (Remote.S3) Response header 'Content-Type': 'application/xml'
[2024-11-12 22:58:00.384533794] (Remote.S3) Response header 'Transfer-Encoding': 'chunked'
[2024-11-12 22:58:00.384581117] (Remote.S3) Response header 'Server': 'AmazonS3'

git-annex: Unable to list contents of s3-dandiarchive: Network.Socket.recvBuf: resource vanished (Connection reset by peer)
failed
[2024-11-12 22:58:00.565431711] (Utility.Process) process [3912839] done ExitSuccess
import: 1 failed

attesting that it is doing something unnecessary -- either listing full bucket (unlikely) or listing all versions of keys under the prefix (e.g. using ListObjectVersions instead of ListObjectsV2).

It would have been useful if logs included the API call involved here.

Comment by yarikoptic — Wed Nov 13 18:08:59 2024

Remove comment

comment 2

No, it does not request versions from S3 when versioning is not enabled.

This feels fairly similar to git-annex-import stalls and uses all ram available. But I don't think it's really the same, that one used versioning, and relied on preferred content to filter the wanted files.

Is the size of the whole bucket under the fileprefix, in your case, large enough that storing a list of all the files (without the versions) could logically take as much memory as you're seeing? At one point you said it was 7k files, but later hundreds of thousands, so I'm confused about how big it is.

Is this bucket supposed to be public? I am having difficulty finding an initremote command that works.

It also seems quite possible, looking at the code, that it's keeping all the responses from S3 in memory until it gets done with listing all the files, which would further increase memory use. I don't see any O(N^2) operations though.

Comment by joey — Thu Nov 14 18:23:54 2024

Remove comment

comment 3

This is the initremote for it:

git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes

It started at 1 API call per second, but it slowed down as memory rapidly went up. 3 gb in a few minutes, so I think there is definitely a memory leak involved.

Comment by joey — Thu Nov 14 18:50:37 2024

Remove comment

comment 4

I suspect one way the CLI tool is faster, aside from not leaking memory, is that there is a max-key max-keys parameter that git-annex is not using. Less pagination would speed it up.

Comment by joey — Thu Nov 14 19:05:48 2024

Remove comment

comment 5

Apparently gbrNextMarker is Nothing despite the response being truncted. So git-annex is looping forever, getting the same first page each time, and storing it all in a list.

I think this is a bug in the aws library, or I'm using it wrong. It looks for a NextMarker in the response XML, but accoccording to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html

This element is returned only if you have the delimiter request parameter specified. If the response does not include the NextMarker element and it is truncated, you can use the value of the last Key element in the response as the marker parameter in the subsequent request to get the next set of object keys.

Comment by joey — Thu Nov 14 19:21:33 2024

Remove comment

comment 6

Fixed in 4b87669ae229c89eadb4ff88eba927e105c003c4. Now it runs in seconds.

Note that this bug does not seem to affect S3 remotes that have versioning enabled.

Comment by joey — Thu Nov 14 20:14:29 2024

Remove comment

comment 7

Trying the same command but with versioning=yes, I have verified that

it does not have the same loop forever behavior
it does use a lot of memory quite quickly

Going back to the unversioned command, I was able to reduce the memory use by 20% by processing each result, rather than building up a list of results and processing at the end. It will be harder to do that in the versioning case, but I expect it will improve it at least that much, and probably more, since it will be able to GC all the delete markers.

Comment by joey — Fri Nov 15 17:16:51 2024

Remove comment

comment 8

Did same memory optimisation for the versioned case, and the results are striking! Running the command until it had made 45 API requests, it was using 592788 kb of memory. Now it uses only 110968 kb.

Of that, about 78900 kb are used at startup, so it grew 29836 kb. At that point, it has gathered 23537 changes. So about 1 kb is used per change. That seems a bit more memory than really should be needed, each change takes about 75 bytes of data, eg:

"y3RixvrmLvr1oWJ7meEa4vWK6B.C.aad",3340,"dandisets/000003/draft/dandiset.jsonld",2021-09-28 02:12:39 UTC

I did try some further memory optimisation, making it avoid storing the same filename repeatedly in memory when gathering versioned changes. Which oddly didn't save any memory.

Memory profiling might let this be improved further, but needing 1 gb of memory to import a million changes to files doesn't seem too bad.

Update: Did some memory profiling, nothing stuck out as badly wrong. Lists and tuples are using as much memory as anything.

Comment by joey — Fri Nov 15 17:48:08 2024

Remove comment

Add a comment