CLDSRV-918: raise SDK request timeout in functional test client#6187
Conversation
Hello tcarmet,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
View the full list of 2 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
|
LGTM |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
|
LGTM |
28c72f9 to
f908622
Compare
|
LGTM |
|
Heavy server-side operations (multi-megabyte UploadPartCopy, 1000-key batch deletes) send no response bytes until they complete and can exceed the SDK request timeout under CI load, failing unrelated tests. Raise it to 30s - ample headroom while staying under mocha's 40s global timeout so a genuine hang still surfaces as the SDK TimeoutError.
5a7ea24 to
362f2d9
Compare
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
|
/create_integration_branches |
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option The following options are set: create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
ConflictA conflict has been raised during the creation of I have not created the integration branch. Here are the steps to resolve this conflict: git fetch
git checkout -B w/9.4/bugfix/CLDSRV-918-raise-test-request-timeout origin/development/9.4
git merge origin/bugfix/CLDSRV-918-raise-test-request-timeout
# <intense conflict resolution>
git commit
git push -u origin w/9.4/bugfix/CLDSRV-918-raise-test-request-timeoutThe following options are set: create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
|
/approve |
Build failedThe build for commit did not succeed in branch bugfix/CLDSRV-918-raise-test-request-timeout The following options are set: approve, create_integration_branches |
Build failedThe build for commit did not succeed in branch w/9.4/bugfix/CLDSRV-918-raise-test-request-timeout The following options are set: approve, create_integration_branches |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
This pull request did not target the following hotfix branch(es) so they
Please check the status of the associated issue CLDSRV-918. Goodbye tcarmet. The following options are set: approve, create_integration_branches |
The AWS SDK v3 migration introduced a 5-second request timeout in the functional test client where SDK v2 previously ran with its 120-second default. Test operations that get no response bytes until the server finishes working - a 50MB server-side part copy, 1000-key batch deletes - regularly exceed 5 seconds of socket silence under CI load, producing
Connection timed out after 5000 msfailures on more than a dozen unrelated branches over the past 60 days, including one merge-queue hit.Raising the timeout to 30 seconds restores the pre-migration behavior within mocha's 40-second global cap: a genuine hang still fails the test, surfacing as the SDK's informative
TimeoutErrorinstead of a generic mocha timeout.Why this also bumps arsenal (ARSN-594, scality/Arsenal#2640):
Raising the timeout uncovered a latent server-crash bug. In
@smithy/node-http-handler, arequestTimeoutbelow 6s is registered throughrequest.setTimeout(), which Node cleans up when the request completes. At 6s and above, registration is deferred by 3s; for any request lasting longer than 3s the timer then attaches directly to the socket (socket.setTimeout()) and is never removed when the request finishes. With a keep-alive agent, the pooled socket keeps that stale timer and its handler later destroys the connection out from under whichever request is using it - so the 30s config turns every >3s test request into a delayed, random mid-stream disconnect.A client vanishing mid-stream then hit a latent bug in arsenal's
retrieveData: itsonce()guard around the iteration callback was created per data-layer callback invocation rather than per iteration, so a double callback during the teardown race ran the iteration callback twice, raisingCallback was already calledas an uncaughtException and killing the server. This crashed thefile-ft-testssuites deterministically (3 out of 3 runs, 200+ cascading ECONNREFUSED failures each). The arsenal fix makes the guard iteration-scoped; a client disconnect can no longer take the server down.Reference for the previous 120s behavior:
aws-sdk@2.1692.0shipshttpOptions: { timeout: 120000 }(lib/config.js#L626-L628), documented as "Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000)" (AWS.Config docs).tests/functional/aws-node-sdk/test/support/config.jsatea4659029configured no timeout at all, so every functional test ran with the 120s default.