Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy

# Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy

Date: 2026-06-11

## Summary

We observed a self-hosted Base mainnet Reth node where the canonical RPC path recovered and stayed healthy, but the Flashblocks path remained badly stale until a node restart.

The key symptom was:

- `eth_getBlockByNumber("latest")` and `newHeads` were current.
- `eth_subscribe ["newFlashblocks"]` was roughly 450 blocks behind `newHeads`.
- Reth metrics reported `reth_reth_flashblocks_pending_snapshot_height=47205334`, matching the stale `pending` block height seen by downstream RPC reads.
- Restarting the Base node immediately restored `newFlashblocks` and the pending snapshot to the canonical head.

This looks like Flashblocks pending-state production/subscription lag, not a full node crash, OOM, or canonical sync failure.

## Node setup

The node was running Base mainnet Reth with Flashblocks enabled:

- Base Reth tag: `v1.0.0`
- Base repo commit: `47b8b3690d3ef34530f8f90441bc733df01c1dda`
- Execution command included: `--websocket-url=wss://mainnet.flashblocks.base.org/ws`
- The command did not include `--engine.cross-block-cache-size`.
- Containers had been up for 7 days before restart.
- `OOMKilled=false`, `RestartCount=0`.

At the pre-restart snapshot, the machine was not under memory pressure:

- Host memory: `61 GiB` total, `45 GiB` available.
- Swap used: `3.6 GiB / 30 GiB`.
- Execution container: about `20.06 GiB / 61.91 GiB` memory.
- Execution process: `VmRSS=49935668 kB`, `VmSwap=2102636 kB`, `Threads=409`.

## Timeline UTC

### 17:28-18:02: Flashblocks upstream reconnect/reorder/reorg signatures

In the Reth execution logs during `2026-06-11T17:28Z..18:02Z`, we saw:

- `No pong response from upstream, reconnecting`: 7 times.
- `WebSocket connection established`: 7 times.
- `Received non-zero index Flashblock`: 2 times.
- `reorg detected`: 17 times.

Representative lines:

```text
2026-06-11T17:28:52.425095Z WARN No pong response from upstream, reconnecting
2026-06-11T17:28:54.203726Z INFO WebSocket connection established
2026-06-11T17:28:56.525666Z ERROR Received non-zero index Flashblock for new block
2026-06-11T17:49:25.964924Z WARN No pong response from upstream, reconnecting
2026-06-11T17:49:27.741799Z INFO WebSocket connection established
```

We did not observe these signatures in that window:

- `State root task timed out`: 0
- `could not process Flashblock`: 0
- long read transaction timeout: 0
- OOM signature: 0
- exact `missing canonical` error: 0

### 18:00-18:05: canonical RPC healthy, Flashblocks stale

At `2026-06-11T18:01:12Z..18:01:45Z`, repeated `eth_getBlockByNumber("latest")` calls were current:

- First sample: block `47205762`, timestamp `2026-06-11T18:01:11Z`, age `1s`.
- Last sample: block `47205776`, timestamp `2026-06-11T18:01:39Z`, age `6s`.
- All samples were `0s..6s` old.

Around the same time, Reth Flashblocks metrics showed the pending snapshot was stale:

```text
reth_reth_flashblocks_upstream_messages 4817121
reth_reth_flashblocks_reconnect_attempts 418
reth_reth_flashblocks_upstream_errors 30
reth_reth_flashblocks_unexpected_block_order 227
reth_reth_flashblocks_block_processing_error 208
reth_reth_flashblocks_pending_clear_reorg 824
reth_reth_flashblocks_pending_clear_catchup 50141
reth_reth_flashblocks_pending_snapshot_height 47205334
reth_reth_flashblocks_pending_snapshot_fb_index 10
reth_sync_block_validation_state_root_task_timeout_total 0
reth_sync_block_validation_state_root_parallel_fallback_total 0
reth_sync_block_validation_state_root_task_fallback_success_total 0
```

A local WebSocket probe to the node around `2026-06-11T18:04Z` showed:

```text
newHeads count=8 unique_blocks=8 first=47205873 last=47205880
newFlashblocks count=68 unique_blocks=7 first=47205422 last=47205428
errors=[]
```

So `newFlashblocks` was about `450` blocks behind `newHeads`, while `newHeads` and HTTP `latest` were current.

### 18:06-18:09: restart cleared the lag

We restarted the node at `2026-06-11T18:06Z`.

After restart, HTTP `latest` stayed current:

- First sample: block `47205939`, timestamp `2026-06-11T18:07:05Z`, age `5s`.
- Last sample: block `47205954`, timestamp `2026-06-11T18:07:35Z`, age `8s`.

Reth metrics around `18:08Z` showed:

```text
reth_reth_flashblocks_upstream_messages 832
reth_reth_flashblocks_pending_snapshot_height 47205983
reth_sync_block_validation_state_root_task_timeout_total 0
reth_sync_block_validation_state_root_parallel_fallback_total 0
reth_sync_block_validation_state_root_task_fallback_success_total 0
```

The WebSocket probe around `18:08Z` showed Flashblocks caught up:

```text
newHeads count=8 unique_blocks=8 first=47205982 last=47205989
newFlashblocks count=81 unique_blocks=9 first=47205983 last=47205991
errors=[]
```

## Downstream impact

The node serves a latency-sensitive application that uses the official Flashblocks paths:

- `eth_subscribe ["pendingLogs", filter]` for ERC20 `Transfer` logs.
- `eth_subscribe ["newFlashblocks"]` probes.
- `eth_getBlockByNumber("pending")` / `BlockId::pending()` through live read paths such as `eth_call`, `eth_estimateGas`, `debug_traceCall`, `eth_getTransactionCount`, and `eth_getBalance`.

During this incident, the application saw a split-brain view of the same Base node:

- Ordinary block subscription / canonical state had advanced to blocks such as `47205325` and later `47205760`.
- Base live/pending reads still returned stale heights such as `47204902` and `47205334`.
- The stale `47205334` matched `reth_reth_flashblocks_pending_snapshot_height` before restart.

One concrete downstream failure:

- A Base sell preparation path repeatedly failed before transaction submission because exit quote calldata simulation became unavailable and the transaction actor rejected actions where `current_block` was far ahead of the pending-derived `live_block`.
- Before restart, a manual rescue attempt found a route but was rejected with `current_block=47205760, live_block=47205334`.
- After restart, the same class of rescue action was able to pass preparation and confirm shortly after restart. We are omitting transaction identifiers from this public report.

This does not prove `eth_sendRawTransaction` itself was broken. The failure happened earlier: stale Flashblocks pending state polluted downstream quote/simulation/readiness logic while canonical `latest` was already healthy.

## Working hypothesis

Our current hypothesis is:

1. A Flashblocks upstream reconnect/reorder/reorg sequence caused pending-state production to fall behind.
2. Canonical sync and ordinary `newHeads` recovered, but the Flashblocks pending snapshot and `newFlashblocks` subscription did not catch up.
3. Downstream consumers that actively subscribe to `pendingLogs` and query `pending` state can observe the stale Flashblocks path even when operators checking only `latest` / `newHeads` see a healthy node.
4. Restarting the node clears the stale Flashblocks pending state.

We do not yet know whether high downstream `pendingLogs`/`pending` read load merely exposed the condition, amplified it, or is required to trigger it.

## Similar public issues we found

These issues look related or adjacent:

- `base/base#2675`: `Base v8.0.0 on mainnet stops synching`, including `No pong response from upstream`, `Received non-zero index Flashblock`, `could not process Flashblock ... missing canonical header`, and OOM-risk old-state-root signatures. https://github.com/base/base/issues/2675
- `base/base#2526`: archive node stalls after Flashblocks disconnect / timeout sequence, with `No pong response`, canonical head plateau, state-root timeout, cache mutex blocking, and missing canonical header. https://github.com/base/base/issues/2526
- `base/base#2896`: Flashblocks rebuilt twice / parent hash mismatch during possible reorg or sequencer failover. https://github.com/base/base/issues/2896
- `base/base#694`: non-sequential Flashblocks. https://github.com/base/base/issues/694
- `base/base#781`: Flashblocks processor panic on empty Flashblocks during reorg/depth-limit reconciliation. https://github.com/base/base/issues/781
- `base/base#613`: documents official Flashblocks `pendingLogs` and `newFlashblocks` subscriptions. https://github.com/base/base/issues/613

Our incident differs from the full-stall reports because canonical `latest` / `newHeads` were healthy at the final pre-restart sampling point, while the Flashblocks path remained about 450 blocks behind.

## Questions for the Base team

1. Is it expected that `newFlashblocks` and `reth_reth_flashblocks_pending_snapshot_height` can remain hundreds of blocks behind while `newHeads` / `latest` are current?
2. Is there a known condition where Flashblocks pending-state production stops catching up after upstream reconnect/reorg/order errors, without causing a full canonical sync stall?
3. Are `pendingLogs` subscribers or high-volume `pending` state reads known to affect Flashblocks pending snapshot catch-up?
4. Is there a health metric or RPC invariant we should monitor to distinguish:
   - canonical chain unhealthy,
   - Flashblocks upstream disconnected,
   - Flashblocks pending snapshot stale,
   - `pendingLogs` consumer lag?
5. Are there recommended Reth flags for Flashblocks-heavy RPC nodes, especially around `--engine.cross-block-cache-size` or RPC cache settings?
6. Is restart currently the expected recovery action when `reth_reth_flashblocks_pending_snapshot_height` remains stale while canonical latest is healthy?

## Raw evidence retained locally

We retained:

- Reth execution logs around `2026-06-11T17:28Z..18:09Z`.
- Pre-restart Docker/container/resource snapshots.
- Pre- and post-restart Reth metrics.
- WebSocket probe outputs for `newHeads` and `newFlashblocks`.
- Downstream application logs showing stale pending-derived `live_block` values matching Reth metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy #1129

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy

Summary

Node setup

Timeline UTC

17:28-18:02: Flashblocks upstream reconnect/reorder/reorg signatures

18:00-18:05: canonical RPC healthy, Flashblocks stale

18:06-18:09: restart cleared the lag

Downstream impact

Working hypothesis

Similar public issues we found

Questions for the Base team

Raw evidence retained locally

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy #1129

Description

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy

Summary

Node setup

Timeline UTC

17:28-18:02: Flashblocks upstream reconnect/reorder/reorg signatures

18:00-18:05: canonical RPC healthy, Flashblocks stale

18:06-18:09: restart cleared the lag

Downstream impact

Working hypothesis

Similar public issues we found

Questions for the Base team

Raw evidence retained locally

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions