Skip to content

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy #1129

@cyborg42

Description

@cyborg42

Base Mainnet Flashblocks pending state lag while canonical latest stayed healthy

Date: 2026-06-11

Summary

We observed a self-hosted Base mainnet Reth node where the canonical RPC path recovered and stayed healthy, but the Flashblocks path remained badly stale until a node restart.

The key symptom was:

  • eth_getBlockByNumber("latest") and newHeads were current.
  • eth_subscribe ["newFlashblocks"] was roughly 450 blocks behind newHeads.
  • Reth metrics reported reth_reth_flashblocks_pending_snapshot_height=47205334, matching the stale pending block height seen by downstream RPC reads.
  • Restarting the Base node immediately restored newFlashblocks and the pending snapshot to the canonical head.

This looks like Flashblocks pending-state production/subscription lag, not a full node crash, OOM, or canonical sync failure.

Node setup

The node was running Base mainnet Reth with Flashblocks enabled:

  • Base Reth tag: v1.0.0
  • Base repo commit: 47b8b3690d3ef34530f8f90441bc733df01c1dda
  • Execution command included: --websocket-url=wss://mainnet.flashblocks.base.org/ws
  • The command did not include --engine.cross-block-cache-size.
  • Containers had been up for 7 days before restart.
  • OOMKilled=false, RestartCount=0.

At the pre-restart snapshot, the machine was not under memory pressure:

  • Host memory: 61 GiB total, 45 GiB available.
  • Swap used: 3.6 GiB / 30 GiB.
  • Execution container: about 20.06 GiB / 61.91 GiB memory.
  • Execution process: VmRSS=49935668 kB, VmSwap=2102636 kB, Threads=409.

Timeline UTC

17:28-18:02: Flashblocks upstream reconnect/reorder/reorg signatures

In the Reth execution logs during 2026-06-11T17:28Z..18:02Z, we saw:

  • No pong response from upstream, reconnecting: 7 times.
  • WebSocket connection established: 7 times.
  • Received non-zero index Flashblock: 2 times.
  • reorg detected: 17 times.

Representative lines:

2026-06-11T17:28:52.425095Z WARN No pong response from upstream, reconnecting
2026-06-11T17:28:54.203726Z INFO WebSocket connection established
2026-06-11T17:28:56.525666Z ERROR Received non-zero index Flashblock for new block
2026-06-11T17:49:25.964924Z WARN No pong response from upstream, reconnecting
2026-06-11T17:49:27.741799Z INFO WebSocket connection established

We did not observe these signatures in that window:

  • State root task timed out: 0
  • could not process Flashblock: 0
  • long read transaction timeout: 0
  • OOM signature: 0
  • exact missing canonical error: 0

18:00-18:05: canonical RPC healthy, Flashblocks stale

At 2026-06-11T18:01:12Z..18:01:45Z, repeated eth_getBlockByNumber("latest") calls were current:

  • First sample: block 47205762, timestamp 2026-06-11T18:01:11Z, age 1s.
  • Last sample: block 47205776, timestamp 2026-06-11T18:01:39Z, age 6s.
  • All samples were 0s..6s old.

Around the same time, Reth Flashblocks metrics showed the pending snapshot was stale:

reth_reth_flashblocks_upstream_messages 4817121
reth_reth_flashblocks_reconnect_attempts 418
reth_reth_flashblocks_upstream_errors 30
reth_reth_flashblocks_unexpected_block_order 227
reth_reth_flashblocks_block_processing_error 208
reth_reth_flashblocks_pending_clear_reorg 824
reth_reth_flashblocks_pending_clear_catchup 50141
reth_reth_flashblocks_pending_snapshot_height 47205334
reth_reth_flashblocks_pending_snapshot_fb_index 10
reth_sync_block_validation_state_root_task_timeout_total 0
reth_sync_block_validation_state_root_parallel_fallback_total 0
reth_sync_block_validation_state_root_task_fallback_success_total 0

A local WebSocket probe to the node around 2026-06-11T18:04Z showed:

newHeads count=8 unique_blocks=8 first=47205873 last=47205880
newFlashblocks count=68 unique_blocks=7 first=47205422 last=47205428
errors=[]

So newFlashblocks was about 450 blocks behind newHeads, while newHeads and HTTP latest were current.

18:06-18:09: restart cleared the lag

We restarted the node at 2026-06-11T18:06Z.

After restart, HTTP latest stayed current:

  • First sample: block 47205939, timestamp 2026-06-11T18:07:05Z, age 5s.
  • Last sample: block 47205954, timestamp 2026-06-11T18:07:35Z, age 8s.

Reth metrics around 18:08Z showed:

reth_reth_flashblocks_upstream_messages 832
reth_reth_flashblocks_pending_snapshot_height 47205983
reth_sync_block_validation_state_root_task_timeout_total 0
reth_sync_block_validation_state_root_parallel_fallback_total 0
reth_sync_block_validation_state_root_task_fallback_success_total 0

The WebSocket probe around 18:08Z showed Flashblocks caught up:

newHeads count=8 unique_blocks=8 first=47205982 last=47205989
newFlashblocks count=81 unique_blocks=9 first=47205983 last=47205991
errors=[]

Downstream impact

The node serves a latency-sensitive application that uses the official Flashblocks paths:

  • eth_subscribe ["pendingLogs", filter] for ERC20 Transfer logs.
  • eth_subscribe ["newFlashblocks"] probes.
  • eth_getBlockByNumber("pending") / BlockId::pending() through live read paths such as eth_call, eth_estimateGas, debug_traceCall, eth_getTransactionCount, and eth_getBalance.

During this incident, the application saw a split-brain view of the same Base node:

  • Ordinary block subscription / canonical state had advanced to blocks such as 47205325 and later 47205760.
  • Base live/pending reads still returned stale heights such as 47204902 and 47205334.
  • The stale 47205334 matched reth_reth_flashblocks_pending_snapshot_height before restart.

One concrete downstream failure:

  • A Base sell preparation path repeatedly failed before transaction submission because exit quote calldata simulation became unavailable and the transaction actor rejected actions where current_block was far ahead of the pending-derived live_block.
  • Before restart, a manual rescue attempt found a route but was rejected with current_block=47205760, live_block=47205334.
  • After restart, the same class of rescue action was able to pass preparation and confirm shortly after restart. We are omitting transaction identifiers from this public report.

This does not prove eth_sendRawTransaction itself was broken. The failure happened earlier: stale Flashblocks pending state polluted downstream quote/simulation/readiness logic while canonical latest was already healthy.

Working hypothesis

Our current hypothesis is:

  1. A Flashblocks upstream reconnect/reorder/reorg sequence caused pending-state production to fall behind.
  2. Canonical sync and ordinary newHeads recovered, but the Flashblocks pending snapshot and newFlashblocks subscription did not catch up.
  3. Downstream consumers that actively subscribe to pendingLogs and query pending state can observe the stale Flashblocks path even when operators checking only latest / newHeads see a healthy node.
  4. Restarting the node clears the stale Flashblocks pending state.

We do not yet know whether high downstream pendingLogs/pending read load merely exposed the condition, amplified it, or is required to trigger it.

Similar public issues we found

These issues look related or adjacent:

Our incident differs from the full-stall reports because canonical latest / newHeads were healthy at the final pre-restart sampling point, while the Flashblocks path remained about 450 blocks behind.

Questions for the Base team

  1. Is it expected that newFlashblocks and reth_reth_flashblocks_pending_snapshot_height can remain hundreds of blocks behind while newHeads / latest are current?
  2. Is there a known condition where Flashblocks pending-state production stops catching up after upstream reconnect/reorg/order errors, without causing a full canonical sync stall?
  3. Are pendingLogs subscribers or high-volume pending state reads known to affect Flashblocks pending snapshot catch-up?
  4. Is there a health metric or RPC invariant we should monitor to distinguish:
    • canonical chain unhealthy,
    • Flashblocks upstream disconnected,
    • Flashblocks pending snapshot stale,
    • pendingLogs consumer lag?
  5. Are there recommended Reth flags for Flashblocks-heavy RPC nodes, especially around --engine.cross-block-cache-size or RPC cache settings?
  6. Is restart currently the expected recovery action when reth_reth_flashblocks_pending_snapshot_height remains stale while canonical latest is healthy?

Raw evidence retained locally

We retained:

  • Reth execution logs around 2026-06-11T17:28Z..18:09Z.
  • Pre-restart Docker/container/resource snapshots.
  • Pre- and post-restart Reth metrics.
  • WebSocket probe outputs for newHeads and newFlashblocks.
  • Downstream application logs showing stale pending-derived live_block values matching Reth metrics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions