Skip to content

CLDSRV-936 Lifecycle noncurrent expiration stalls when listing truncates on a bare master#6205

Open
nicolas2bert wants to merge 3 commits into
development/9.3from
bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster
Open

CLDSRV-936 Lifecycle noncurrent expiration stalls when listing truncates on a bare master#6205
nicolas2bert wants to merge 3 commits into
development/9.3from
bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster

Conversation

@nicolas2bert

Copy link
Copy Markdown
Contributor

The bug in one sentence
The lifecycle noncurrent listing returns the marker version-id-marker=null when a scan page stops on a "bare master" (an object with no internal versionId). CloudServer can't decode 'null', so the worker crashes. The bucket hasn't changed in months, so it crashes every cycle and the noncurrent versions are never expired.

How a bare master (null master) gets created

  • bucket created, not versioned yet.

  • PUT foo -> stored as a plain master key "foo", the value has NO versionId field (x-amz-version-id: "null"). <- this is the "bare master"

  • versioning ENABLED later on the bucket.

  • foo is NEVER re-PUT -> the bare master stays as-is, forever.
    So a bare master = an object written before versioning, never re-PUT afterward. If it had been overwritten, the system would have converted it into a normal null version (id 99999...,isNull2:true), which is harmless. To trigger the bug the bare master never has to be overwritten (which is the case).

How the noncurrent listing works in metadata (bucketd)
The bucket-processor asks CloudServer for a DelimiterNonCurrent listing, which hits bucketd.
bucketd scans the keys in order, newest->oldest per key.
For each key:
the first version seen = the current one -> not returned, but its last-modified is kept in memory.
the following versions = noncurrent -> returned, with a staleDate (= the in memory last-modified).

The staleDate is the last-modified of the version that replaced it, ie the moment that version became noncurrent. It's this staleDate that gets compared to NoncurrentDays. A noncurrent is only returned if staleDate < beforeDate, ie old enough.

Two limits bound each page:

  • max-keys (default 1000) -> number of noncurrent versions returned.

  • maxScannedLifecycleListingEntries (default 10000) -> number of entries scanned, so that a bucket full of current objects but with few noncurrents still responds quickly.

When a limit is hit -> IsTruncated=true + a continuation marker (NextKeyMarker + NextVersionIdMarker), and backbeat re-requests the next page with those markers, until IsTruncated=false.

The NextVersionIdMarker is computed as getVersionId() || 'null'. And that's exactly where it goes wrong: on a bare master, getVersionId() is undefined -> the marker becomes the string 'null'.

Example of the issue:

scan# key kind marker becomes
1 bar master (regular) 982... (real id)
2 bar \0 982... current version 982...
3 bar \0 981... non-current -> expirable 981...
... ... ~10k entries, all real ids ...
10000 foo BARE master (no versionId) getVersionId() || 'null' = "null"
>>> STOP (scanned 10000): IsTruncated=true,
NextKeyMarker="foo", NextVersionIdMarker="null" <-- the poison marker
10001 baz ... (next page - never reached)

The noncurrent versions to delete are on the next page, behind that marker.
Condition to trigger the issue: the 10000th scanned entry is a bare master.
If it were a normal version/master, the marker would be a valid id and nothing would break.

NOTE: Only scan-limit truncation can produce 'null', max-keys truncation always stops on a real noncurrent -> real id.

Why it's a bug
The marker round-trip is asymmetric. 'null' is guarded on the way out, not on the way in:
Page 1 version-id-marker= -> returns NextVersionIdMarker:"null" (encode handle 'null' properly-> OK)
Page 2 version-id-marker=null-> decode("null") = Error (decode does NOT handle 'null' properly) -> encode(Error) -> TypeError -> CloudServer worker crash.

bucketd answers page 2 without any problem (HTTP 200). The crash is 100% on the CloudServer side, while re-encoding the version-id-marker=null from the request itself.

Why it stays stuck
The processor gives up after its retries and moves on to something else. But the versions to expire are on page 2, which crashes every time. The bucket doesn't change -> every cycle page 1 re-emits 'null', page 2 re-crashes.

Net progress = zero, indefinitely -> "noncurrent versions never removed".

the fix : treat version-id-marker=null as "no marker" instead of blindly decoding it.

@bert-e

bert-e commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Hello nicolas2bert,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e

bert-e commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Branches have diverged

This pull request's source branch bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster has diverged from
development/9.3 by more than 50 commits.

To avoid any integration risks, please re-synchronize them using one of the
following solutions:

  • Merge origin/development/9.3 into bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster
  • Rebase bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster onto origin/development/9.3

Note: If you choose to rebase, you may have to ask me to rebuild
integration branches using the reset command.

@nicolas2bert nicolas2bert force-pushed the bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster branch from fc1d50e to 1602dc3 Compare June 24, 2026 09:09
@bert-e

bert-e commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Incorrect fix version

The Fix Version/s in issue CLDSRV-936 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.12

  • 9.4.0

Please check the Fix Version/s of CLDSRV-936, or the target
branch of this pull request.

@nicolas2bert nicolas2bert force-pushed the bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster branch 2 times, most recently from 8b35ba5 to 02aaafa Compare June 24, 2026 09:13
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

❌ 29 Tests Failed:

Tests completed Failed Passed Skipped
8341 29 8312 0
View the top 3 failed test(s) by shortest run time
should allow writes after deleting data with quotas below the current number of inflights::quota evaluation with scuba metrics should allow writes after deleting data with quotas below the current number of inflights
Stack Traces | 0.008s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket8/?quota=true: Premature close
should allow zero as a valid RequestsPerSecond value::Test put bucket rate limit should allow zero as a valid RequestsPerSecond value
Stack Traces | 0.008s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/putratelimitestbucket/?rate-limit: Premature close
should set the rate limit config::Test put bucket rate limit should set the rate limit config
Stack Traces | 0.008s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/putratelimitestbucket/?rate-limit: Premature close
should update existing rate limit config::Test put bucket rate limit should update existing rate limit config
Stack Traces | 0.008s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/putratelimitestbucket/?rate-limit: Premature close
should accept limits equal to (nodes x workers)::Test put bucket rate limit validation against node and worker count should accept limits equal to (nodes x workers)
Stack Traces | 0.009s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/putratelimitestbucket/?rate-limit: Premature close
should accept limits greater than (nodes x workers)::Test put bucket rate limit validation against node and worker count should accept limits greater than (nodes x workers)
Stack Traces | 0.009s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/putratelimitestbucket/?rate-limit: Premature close
should allow writes after multi-deleting data with quotas below the current number of inflights::quota evaluation with scuba metrics should allow writes after multi-deleting data with quotas below the current number of inflights
Stack Traces | 0.009s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket10/?quota=true: Premature close
should return the rate limit config::Test get bucket rate limit should return the rate limit config
Stack Traces | 0.009s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/getratelimitestbucket/?rate-limit: Premature close
should delete the bucket rate limit config::Test delete bucket rate limit should delete the bucket rate limit config
Stack Traces | 0.012s run time
ifError got unwanted exception: Invalid response body while trying to fetch http://127.0.0.1:8000/deleteratelimitestbucket/?rate-limit: Premature close
should allow writes after deleting data with quotas::quota evaluation with scuba metrics should allow writes after deleting data with quotas
Stack Traces | 0.016s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket8/?quota=true: Premature close
should decrease the inflights when performing multi object delete::quota evaluation with scuba metrics should decrease the inflights when performing multi object delete
Stack Traces | 0.016s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket10/?quota=true: Premature close
should allow a restore if the quota is full but the objet fits with its reserved storage space::quota evaluation with scuba metrics should allow a restore if the quota is full but the objet fits with its reserved storage space
Stack Traces | 0.017s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket15/?quota=true: Premature close
should only evaluate quota and not update inflights for PutObject with the x-scal-s3-version-id header::quota evaluation with scuba metrics should only evaluate quota and not update inflights for PutObject with the x-scal-s3-version-id header
Stack Traces | 0.017s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket13/?quota=true: Premature close
should not increase the inflights when the object is being rewritten with a smaller object::quota evaluation with scuba metrics should not increase the inflights when the object is being rewritten with a smaller object
Stack Traces | 0.018s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket9/?quota=true: Premature close
should not return QuotaExceeded if the quota is not exceeded::quota evaluation with scuba metrics should not return QuotaExceeded if the quota is not exceeded
Stack Traces | 0.018s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket4/?quota=true: Premature close
should reduce inflights when aborting MPU::quota evaluation with scuba metrics should reduce inflights when aborting MPU
Stack Traces | 0.02s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket-mpu2/?quota=true: Premature close
should return QuotaExceeded when trying to CopyObject in a bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to CopyObject in a bucket with quota
Stack Traces | 0.02s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket2/?quota=true: Premature close
should return QuotaExceeded when trying to complete MPU in a bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to complete MPU in a bucket with quota
Stack Traces | 0.02s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket3/?quota=true: Premature close
should not update the inflights if the API errored after evaluating quotas (deletion)::quota evaluation with scuba metrics should not update the inflights if the API errored after evaluating quotas (deletion)
Stack Traces | 0.024s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket11/?quota=true: Premature close
should not update the inflights if the quota check is passing but the object is already restored::quota evaluation with scuba metrics should not update the inflights if the quota check is passing but the object is already restored
Stack Traces | 0.024s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket14/?quota=true: Premature close
should reduce inflights when completing MPU with fewer parts than uploaded::quota evaluation with scuba metrics should reduce inflights when completing MPU with fewer parts than uploaded
Stack Traces | 0.024s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket-mpu1/?quota=true: Premature close
should return QuotaExceeded when trying to restore an object in a bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to restore an object in a bucket with quota
Stack Traces | 0.026s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket7/?quota=true: Premature close
should return QuotaExceeded when trying to copy a part in a bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to copy a part in a bucket with quota
Stack Traces | 0.027s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket6/?quota=true: Premature close
should log correct objectGetRetention operation with all required fields::Server Access Logs - File Output With default signature should log correct objectGetRetention operation with all required fields
Stack Traces | 0.039s run time
Expected 4 log entries, got 5

5 !== 4
should return QuotaExceeded when trying to copyObject in a versioned bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to copyObject in a versioned bucket with quota
Stack Traces | 0.039s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket12/?quota=true: Premature close
should return QuotaExceeded when trying to PutObject in a bucket with quota::quota evaluation with scuba metrics should return QuotaExceeded when trying to PutObject in a bucket with quota
Stack Traces | 0.084s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/quota-test-bucket1/?quota=true: Premature close
should not evaluate quotas if the backend is not available::quota evaluation with scuba metrics should not evaluate quotas if the backend is not available
Stack Traces | 30s run time
Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (.../tests/sur/quota.js)
View the full list of 7 ❄️ flaky test(s)
"after each" hook for "should fail if trying to overwrite a delete marker"::MPU with x-scal-s3-version-id header With default signature "after each" hook for "should fail if trying to overwrite a delete marker"

Flake rate in main: 100.00% (Passed 0 times, Failed 86 times)

Stack Traces | 0.013s run time
We encountered an internal error. Please try again.
"after each" hook for "should fail if trying to overwrite a delete marker"::MPU with x-scal-s3-version-id header With v4 signature "after each" hook for "should fail if trying to overwrite a delete marker"

Flake rate in main: 100.00% (Passed 0 times, Failed 76 times)

Stack Traces | 0.015s run time
We encountered an internal error. Please try again.
should accept large quota::Test update bucket quota should accept large quota

Flake rate in main: 4.17% (Passed 23 times, Failed 1 times)

Stack Traces | 0.005s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/updatequotatestbucket/?quota=true: Premature close
should return the quota::Test get bucket quota should return the quota

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.007s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/getquotatestbucket/?quota=true: Premature close
should update quota with XML format::Test update bucket quota should update quota with XML format

Flake rate in main: 4.17% (Passed 23 times, Failed 1 times)

Stack Traces | 0.006s run time
Expected no error, but got FetchError: Invalid response body while trying to fetch http://127.0.0.1:8000/updatequotatestbucket/?quota=true: Premature close
should update quota with explicit JSON content-type::Test update bucket quota should update quota with explicit JSON content-type

Flake rate in main: 4.17% (Passed 23 times, Failed 1 times)

Stack Traces | 0.006s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/updatequotatestbucket/?quota=true: Premature close
should update the quota, using json parsing by default::Test update bucket quota should update the quota, using json parsing by default

Flake rate in main: 4.17% (Passed 23 times, Failed 1 times)

Stack Traces | 0.007s run time
Invalid response body while trying to fetch http://127.0.0.1:8000/updatequotatestbucket/?quota=true: Premature close

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment thread tests/unit/api/apiUtils/lifecycle.js Outdated
// a malformed (non-null) marker that decode rejects by returning an Error
const result = decodeVersionIdMarker('@@@bad@@@');
assert(result instanceof Error);
assert.strictEqual(result.is.InvalidArgument, true);

@BourgoisMickael BourgoisMickael Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better to compare value instead of boolean to see in failure message the actual vs expected value instread of false !== true

Suggested change
assert.strictEqual(result.is.InvalidArgument, true);
assert.strictEqual(result.message, 'InvalidArgument');

Comment thread tests/unit/api/apiUtils/lifecycle.js Outdated
Comment thread lib/api/apiUtils/object/lifecycle.js Outdated
Apply Prettier formatting to the four files touched by this branch so the
prettier:diff CI check passes. Pure formatting, no logic changes.
@nicolas2bert nicolas2bert force-pushed the bugfix/CLDSRV-936/listLifecycleNonCurrents-baremaster branch from 02aaafa to 046ca1a Compare June 25, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants