Skip to content

rollup-node can get stuck with safe=genesis after restart during partial sync #501

Description

@shu-unifra

Summary

A rollup-node follower/sequencer node can get permanently stuck after being restarted during partial sync. On restart, the node restores the execution head from the local execution provider, but the forkchoice state's safe and finalized blocks remain at genesis. The L1 watcher then continues deriving later batches, and the chain orchestrator repeatedly fails with InvalidBatchReorg { safe_block_number: 0, ... }.

This was observed with scrolltech/rollup-node:v1.0.7-rc6 on a custom/dev Scroll-compatible chain using persistent storage.

Environment

  • Image: scrolltech/rollup-node:v1.0.7-rc6
  • Deployment: Kubernetes StatefulSet with persistent /data
  • Role: follower / standby sequencer, sequencing enabled but automatic sequencing disabled
  • Data source: L1 RPC + blob/S3 batch data
  • Discovery: disabled, trusted peers configured
  • Source checked locally at commit: bc3d500

No private keys or node keys are relevant to this issue.

What happened

The node was partially synced. Before restart, it had imported historical derived L2 blocks. After a restart, startup found the last L2 block that existed in the execution node and set the local L2 head to that block:

Checking for L2 head block in EN l2_head_block_number=157478
Checking for L2 head block in EN l2_head_block_number=157477
Checking for L2 head block in EN l2_head_block_number=157476
Found L2 head block in EN l2_head_block_number=157476

Then the engine driver started with this forkchoice state:

Starting engine driver fcs="ForkchoiceState {
  head: BlockInfo { number: 157476, hash: 0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25 },
  safe: BlockInfo { number: 0, hash: 0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc },
  finalized: BlockInfo { number: 0, hash: 0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc }
}"

The L1 watcher then started from a later finalized L1 block:

Starting L1 watcher l1_block_startup_info=FinalizedBlockNumber(14092206)

The next derived batch required continuing around L2 block 157479, but because safe was still genesis, the orchestrator rejected every batch:

Handling derived batch batch_info="BatchInfo { index: 799, hash: 0x57b75e730f5ae14637cababbe13721508c1f965ea019ffdc3fae9bc1938242b4 }" num_blocks=684
Reorging chain to derived block block_number=157479
Encountered error in the chain orchestrator err="InvalidBatchReorg { batch_info: BatchInfo { index: 799, hash: 0x57b75e730f5ae14637cababbe13721508c1f965ea019ffdc3fae9bc1938242b4 }, safe_block_number: 0, derived_block_number: 157479 }"

The same error repeated for later batches with increasing derived block numbers. The node stopped making L2 progress.

RPC status at this point showed:

{
  "l1": {
    "status": "Syncing",
    "latest": 38713685,
    "finalized": 38713685,
    "processed": 15671705
  },
  "l2": {
    "status": "Synced",
    "head": {
      "number": 157476,
      "hash": "0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25"
    },
    "safe": {
      "number": 0,
      "hash": "0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc"
    },
    "finalized": {
      "number": 0,
      "hash": "0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc"
    }
  }
}

A 20s height sample showed no progress:

rpc1=157476 db1=157476
rpc2=157476 db2=157476
delta_rpc=0 delta_db=0

Relevant database state

The rollup DB metadata had:

l1_finalized_block|38713685
l1_latest_block|38713685
l1_processed_block|14151705
l2_head_block|157476

The rollup DB still had safe block records ahead of the execution node's persisted head:

select max(block_number) from l2_block where reverted=0;
157478

But the execution node only had block 157476; blocks 157477 and 157478 were missing from the execution provider after restart:

eth_getBlockByNumber(157476) -> 0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25
eth_getBlockByNumber(157477) -> null
eth_getBlockByNumber(157478) -> null
eth_getBlockByNumber(157479) -> null

Expected behavior

After restart during partial sync, the node should recover to a consistent forkchoice state and continue syncing.

In particular, if startup rolls l2_head_block back to the latest block present in the execution provider, it should also make safe/finalized consistent with the recovered execution state, or prune/reconcile rollup DB safe block records that are above the recovered execution head.

The node should not continue with:

head = recovered EN block
safe = genesis
finalized = genesis

when the L1 watcher is going to continue deriving later batches.

Actual behavior

The node starts with a non-genesis head but genesis safe/finalized, then continuously fails with:

InvalidBatchReorg { safe_block_number: 0, derived_block_number: <large number> }

It does not make further L2 progress without manual intervention.

Workaround used

The node was recovered manually by:

  1. Calling rollupNodeAdmin_revertToL1Block to rewind rollup-node state to a previous known-good L1 batch boundary.
  2. Restarting the pod again so the node could replay from that clean point.

After doing this, forkchoice recovered to a consistent state and syncing resumed:

{
  "l2": {
    "head": { "number": 162854 },
    "safe": { "number": 162854 },
    "finalized": { "number": 162854 }
  }
}

A subsequent 30s sample showed progress again:

rpc1=167868 db1=167452
rpc2=174031 db2=173904
delta_rpc=6163 delta_db=6452

Source locations that look related

Startup finds the latest L2 head block present in the execution provider and updates the FCS head / DB head:

  • crates/node/src/args.rs around the startup flow that calls ForkchoiceState::from_provider, prepare_l1_watcher_start_info, and then scans for l2_head_block_number in the execution provider.

prepare_l1_watcher_start_info resets processing batches and returns L1 startup info, but does not appear to restore an L2 safe/finalized forkchoice state:

  • crates/database/db/src/operations.rs around prepare_l1_watcher_start_info.

The DB has a helper to fetch the latest safe L2 block:

  • crates/database/db/src/operations.rs around get_latest_safe_l2_info.

The orchestrator fails when safe_block_number != derived_block_number - 1:

  • crates/chain-orchestrator/src/lib.rs around the InvalidBatchReorg check in handle_derived_batch.

Notes

This is easiest to trigger when the execution provider and rollup DB are slightly out of sync at shutdown/restart, for example when the rollup DB has recorded safe L2 block rows that are above the last block actually persisted by the execution provider.

The issue is not related to S3/blob availability or P2P connectivity in this case; blobs were reachable, peers were connected, and the node resumed syncing after the forkchoice/rollup DB state was manually rewound and the pod restarted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions