rollup-node can get stuck with safe=genesis after restart during partial sync

## Summary

A `rollup-node` follower/sequencer node can get permanently stuck after being restarted during partial sync. On restart, the node restores the execution head from the local execution provider, but the forkchoice state's `safe` and `finalized` blocks remain at genesis. The L1 watcher then continues deriving later batches, and the chain orchestrator repeatedly fails with `InvalidBatchReorg { safe_block_number: 0, ... }`.

This was observed with `scrolltech/rollup-node:v1.0.7-rc6` on a custom/dev Scroll-compatible chain using persistent storage.

## Environment

- Image: `scrolltech/rollup-node:v1.0.7-rc6`
- Deployment: Kubernetes StatefulSet with persistent `/data`
- Role: follower / standby sequencer, sequencing enabled but automatic sequencing disabled
- Data source: L1 RPC + blob/S3 batch data
- Discovery: disabled, trusted peers configured
- Source checked locally at commit: `bc3d500`

No private keys or node keys are relevant to this issue.

## What happened

The node was partially synced. Before restart, it had imported historical derived L2 blocks. After a restart, startup found the last L2 block that existed in the execution node and set the local L2 head to that block:

```text
Checking for L2 head block in EN l2_head_block_number=157478
Checking for L2 head block in EN l2_head_block_number=157477
Checking for L2 head block in EN l2_head_block_number=157476
Found L2 head block in EN l2_head_block_number=157476
```

Then the engine driver started with this forkchoice state:

```text
Starting engine driver fcs="ForkchoiceState {
  head: BlockInfo { number: 157476, hash: 0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25 },
  safe: BlockInfo { number: 0, hash: 0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc },
  finalized: BlockInfo { number: 0, hash: 0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc }
}"
```

The L1 watcher then started from a later finalized L1 block:

```text
Starting L1 watcher l1_block_startup_info=FinalizedBlockNumber(14092206)
```

The next derived batch required continuing around L2 block `157479`, but because `safe` was still genesis, the orchestrator rejected every batch:

```text
Handling derived batch batch_info="BatchInfo { index: 799, hash: 0x57b75e730f5ae14637cababbe13721508c1f965ea019ffdc3fae9bc1938242b4 }" num_blocks=684
Reorging chain to derived block block_number=157479
Encountered error in the chain orchestrator err="InvalidBatchReorg { batch_info: BatchInfo { index: 799, hash: 0x57b75e730f5ae14637cababbe13721508c1f965ea019ffdc3fae9bc1938242b4 }, safe_block_number: 0, derived_block_number: 157479 }"
```

The same error repeated for later batches with increasing derived block numbers. The node stopped making L2 progress.

RPC status at this point showed:

```json
{
  "l1": {
    "status": "Syncing",
    "latest": 38713685,
    "finalized": 38713685,
    "processed": 15671705
  },
  "l2": {
    "status": "Synced",
    "head": {
      "number": 157476,
      "hash": "0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25"
    },
    "safe": {
      "number": 0,
      "hash": "0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc"
    },
    "finalized": {
      "number": 0,
      "hash": "0xf9f7c524dce38b51a4d28ec2f18680773e5ba9d3f5f430d0e05f92cfeb65b1bc"
    }
  }
}
```

A 20s height sample showed no progress:

```text
rpc1=157476 db1=157476
rpc2=157476 db2=157476
delta_rpc=0 delta_db=0
```

## Relevant database state

The rollup DB metadata had:

```text
l1_finalized_block|38713685
l1_latest_block|38713685
l1_processed_block|14151705
l2_head_block|157476
```

The rollup DB still had safe block records ahead of the execution node's persisted head:

```text
select max(block_number) from l2_block where reverted=0;
157478
```

But the execution node only had block `157476`; blocks `157477` and `157478` were missing from the execution provider after restart:

```text
eth_getBlockByNumber(157476) -> 0xbe4fe851ea59ee7cbf959165dc7cb6b45f987fa515d36a62721d358b4fc0cc25
eth_getBlockByNumber(157477) -> null
eth_getBlockByNumber(157478) -> null
eth_getBlockByNumber(157479) -> null
```

## Expected behavior

After restart during partial sync, the node should recover to a consistent forkchoice state and continue syncing.

In particular, if startup rolls `l2_head_block` back to the latest block present in the execution provider, it should also make `safe`/`finalized` consistent with the recovered execution state, or prune/reconcile rollup DB safe block records that are above the recovered execution head.

The node should not continue with:

```text
head = recovered EN block
safe = genesis
finalized = genesis
```

when the L1 watcher is going to continue deriving later batches.

## Actual behavior

The node starts with a non-genesis head but genesis safe/finalized, then continuously fails with:

```text
InvalidBatchReorg { safe_block_number: 0, derived_block_number: <large number> }
```

It does not make further L2 progress without manual intervention.

## Workaround used

The node was recovered manually by:

1. Calling `rollupNodeAdmin_revertToL1Block` to rewind rollup-node state to a previous known-good L1 batch boundary.
2. Restarting the pod again so the node could replay from that clean point.

After doing this, forkchoice recovered to a consistent state and syncing resumed:

```json
{
  "l2": {
    "head": { "number": 162854 },
    "safe": { "number": 162854 },
    "finalized": { "number": 162854 }
  }
}
```

A subsequent 30s sample showed progress again:

```text
rpc1=167868 db1=167452
rpc2=174031 db2=173904
delta_rpc=6163 delta_db=6452
```

## Source locations that look related

Startup finds the latest L2 head block present in the execution provider and updates the FCS head / DB head:

- `crates/node/src/args.rs` around the startup flow that calls `ForkchoiceState::from_provider`, `prepare_l1_watcher_start_info`, and then scans for `l2_head_block_number` in the execution provider.

`prepare_l1_watcher_start_info` resets processing batches and returns L1 startup info, but does not appear to restore an L2 safe/finalized forkchoice state:

- `crates/database/db/src/operations.rs` around `prepare_l1_watcher_start_info`.

The DB has a helper to fetch the latest safe L2 block:

- `crates/database/db/src/operations.rs` around `get_latest_safe_l2_info`.

The orchestrator fails when `safe_block_number != derived_block_number - 1`:

- `crates/chain-orchestrator/src/lib.rs` around the `InvalidBatchReorg` check in `handle_derived_batch`.

## Notes

This is easiest to trigger when the execution provider and rollup DB are slightly out of sync at shutdown/restart, for example when the rollup DB has recorded safe L2 block rows that are above the last block actually persisted by the execution provider.

The issue is not related to S3/blob availability or P2P connectivity in this case; blobs were reachable, peers were connected, and the node resumed syncing after the forkchoice/rollup DB state was manually rewound and the pod restarted.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rollup-node can get stuck with safe=genesis after restart during partial sync #501

Summary

Environment

What happened

Relevant database state

Expected behavior

Actual behavior

Workaround used

Source locations that look related

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

rollup-node can get stuck with safe=genesis after restart during partial sync #501

Description

Summary

Environment

What happened

Relevant database state

Expected behavior

Actual behavior

Workaround used

Source locations that look related

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions