Enforce table-diff `max_diff_rows` per node pair by danolivo · Pull Request #123 · pgEdge/ace

danolivo · 2026-06-19T12:48:39Z

Problem

table-diff enforced max_diff_rows with a single task-wide counter (totalDiffRows), incremented by every differing row regardless of which node pair it belonged to and reset once per ExecuteTask. On a cluster with more than two nodes this turned the cap into one budget shared across all C(n,2) pairs: once the combined total reached the limit, every worker's stop-check fired and enumeration halted — leaving the report truncated and the reported differences split arbitrarily across pairs by whichever concurrent workers consumed the budget first.

In the reported case (n1 = 1,100,006 rows; n2 = n3 = 100,006) the true divergence is ~1,000,000 rows on each of the two n1 pairs (n2/n3 identical). With the global cap of 1,000,000, only half was enumerated in one pass — split 510599 (n1/n3) + 489401 (n1/n2) = exactly the cap — so achieving consistency required multiple sequential diff+repair cycles. A 2-node cluster was never affected because "total" and "the pair" were the same thing; the latent assumption only surfaced for N > 2.

Fix

Make the limit per node pair:

Replace totalDiffRows atomic.Int64 with pairDiffRows sync.Map (pairKey -> *atomic.Int64).
Route every limit check and increment through shouldStopPair(pairKey) / incrementPairDiffRowsLocked(pairKey, …); add a pairKeyFor helper that centralises the canonical (lexically ordered) pair key previously computed inline.
The pre-diff initial-hash loop now uses only the error circuit-breaker (no rows are counted there, so the row limit was always dead at that point).
Remove the now-unused totalDiffRows, shouldStop, shouldStopDueToLimit, incrementDiffRowsLocked.

Each pair now enumerates up to max_diff_rows independently; a pair under the cap reports all of its differences instead of being starved by another pair's divergence. The cap is best-effort (a pair may report a handful of rows beyond the limit under concurrency — same as the prior global counter), documented on shouldStopPair.

Behaviour change (operator-visible)

On N-node clusters the worst-case report size is now max_diff_rows × number_of_pairs rather than max_diff_rows total. A genuinely huge divergence therefore produces a larger report instead of silently truncating; such a node should be resynced rather than repaired row-by-row. Docs update (per-pair semantics + report-size note) is tracked separately and not in this PR.

Zaid re-reported ACE-189 on a 3-node cluster: with n3 holding the cluster max, every diff pair involving n3 was short by exactly one row (its largest), while n1/n2 was correct. Reproduced his exact scenario faithfully against the current branch — it now produces the correct 5/10/5 counts (id=2010 included), confirming bb94210/73c3ccd/f9a3961 already cover it. His earlier observation must have predated the fix in his checkout. These two guards lock that coverage in; both exercise a MULTI-LEAF tree (row counts above the cluster min block size of 1000), the case the existing single-leaf bidirectional tests never reached — the boundary where the last closed leaf [B, max] meets the open tail [max, NULL): * TestMerkleTreeReferenceMaxInMultiLeafTail — minimal 2-node form: the reference (n2, more rows) holds the cluster max; asserts the full {2001..2010} diff including the tail row. * TestMerkleTreeThreeNodeReferenceTail — faithful 3-node reproduction. The shared test_cluster.json only registers n1/n2, so this emits its own test_cluster_3node.json via the new writeClusterConfigJSON helper and points the task at it, leaving the rest of the suite (and its Nodes="all" behaviour) untouched. Asserts n1/n3=10, n2/n3=5, n1/n2=5 — Zaid's exact shape. Full mtree suite green with both added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

table-diff tracked max_diff_rows with a single task-wide counter (totalDiffRows), incremented by every differing row regardless of which node pair it belonged to and reset once per ExecuteTask. On a cluster with more than two nodes this turned the cap into one budget shared across all C(n,2) pairs: once the combined total reached the limit, every worker's stop-check fired and enumeration halted, leaving the report truncated and the reported differences split arbitrarily across pairs by whichever concurrent workers consumed the budget first. In the reported case (n1=1.1M rows, n2=n3=100k) the true divergence is ~1M rows on each of the two n1 pairs (n2/n3 identical). With the global cap of 1,000,000 only half was enumerated in one pass, split 510599 (n1/n3) + 489401 (n1/n2) = exactly the cap — requiring multiple diff+repair cycles to converge. A single pair (2-node cluster) was unaffected because "total" and "the pair" were the same thing; the latent assumption only surfaced for N>2. Make the limit per node pair: replace totalDiffRows with a sync.Map of pairKey -> *atomic.Int64 and route every limit check and increment through shouldStopPair(pairKey) / incrementPairDiffRowsLocked. The pre-diff initial-hash loop now uses only the error circuit-breaker (no rows are counted there, so the limit was always dead at that point). Each pair now enumerates up to max_diff_rows independently; a pair under the cap reports all its differences instead of being starved by another pair's divergence. Behaviour change worth noting: on N-node clusters the worst-case report size is now max_diff_rows x number_of_pairs rather than max_diff_rows total. A genuinely huge divergence therefore produces a larger report rather than silently truncating — the operator should resync such a node rather than repair row-by-row. (Docs follow separately.) Tests: * TestTableDiffMaxDiffRowsPerPair (new, 3-node) — n1 holds 1..20; n2/n3 each hold the same {1..5, 16..20} (10 middle rows missing, avoiding the open-tail boundary confound). With max_diff_rows=14 (> per-pair 10, < cross-pair total 20): before this fix the run stops at 14 total (n1/n3=10, n1/n2=4) and sets DiffRowLimitReached; after, both pairs report all 10 and the limit is not tripped. Uses its own test_cluster_3node.json via the new writeThreeNodeClusterConfig helper, leaving the shared 2-node cluster config untouched. * Existing 2-node MaxDiffRowsLimit tests (simple + composite PK) still pass — the per-pair cap collapses to the old behaviour for one pair. go build / go vet clean; table-diff simple/composite/until suites green. Author: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-19T12:48:54Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a12552d6-8e13-41bf-b57f-b7b247ede1c9

📥 Commits

Reviewing files that changed from the base of the PR and between df5b29a and 187be93.

📒 Files selected for processing (3)

internal/consistency/diff/table_diff.go
tests/integration/merkle_tree_test.go
tests/integration/table_diff_test.go

📝 Walkthrough

Walkthrough

TableDiffTask replaces its single global diff-row counter with a sync.Map tracking per-canonical-node-pair row counts. New helpers (pairKeyFor, pairCounter, shouldStopPair, incrementPairDiffRowsLocked) enforce a MaxDiffRows budget independently per pair throughout recursiveDiff. Integration tests cover the per-pair limiting behavior and a merkle-tree multi-leaf tail bug (ACE-189).

Changes

Per-pair diff row budget

Layer / File(s)	Summary
Per-pair budget field and helpers `internal/consistency/diff/table_diff.go`	Adds `pairDiffRows sync.Map` to `TableDiffTask` and introduces `pairKeyFor`, `pairCounter`, `shouldStopPair`, and `incrementPairDiffRowsLocked` for canonical-key-based per-pair budgeting with CAS-triggered `diffLimitTriggered`.
ExecuteTask and hash-worker wiring `internal/consistency/diff/table_diff.go`	Clears `pairDiffRows` at task start instead of resetting the old global counter; changes the hash-worker stop condition to `hasError()` only; adds a `shouldStopPair` guard before launching mismatch analysis for a pair.
Per-pair stop checks and row counts in `recursiveDiff` `internal/consistency/diff/table_diff.go`	Replaces every global `shouldStop`/`incrementDiffRowsLocked` call with per-pair equivalents at `recursiveDiff` entry, within `Node1OnlyRows`/`Node2OnlyRows`/`ModifiedRows` appending, at the continuation condition, and before sub-range recursion and mismatch fan-out. Removes the old local pair-key derivation in favor of `pairKeyFor`.
Integration regression test for per-pair row limiting `tests/integration/table_diff_test.go`	Adds `TestTableDiffMaxDiffRowsPerPair` on a 3-node cluster with controlled divergence (10 diffs per pair, `MaxDiffRows=14`), asserting `DiffRowLimitReached` is false and per-pair counts are exactly 10 for (N1,N2) and (N1,N3) and 0 for (N2,N3).
Merkle tree multi-leaf tail bug tests (ACE-189) `tests/integration/merkle_tree_test.go`	Adds `TestMerkleTreeReferenceMaxInMultiLeafTail` (2-node) and `TestMerkleTreeThreeNodeReferenceTail` (3-node) with JSON cluster-config helpers, asserting the reference node's max row in an open-ended tail leaf is not dropped in multi-leaf merkle diffs.

Poem

🐇 Hop hop, no more shared cap to share,
Each node-pair now gets its own fair snare!
The sync.Map keeps counters neat and true,
No pair can starve the others — who knew?
ACE-189's tail leaf stands tall,
Per-pair budgets answer the call. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and clearly summarizes the main change: enforcing the table-diff max_diff_rows limit on a per-node-pair basis instead of globally, which is the central objective of this PR.
Description check	✅ Passed	The description comprehensively explains the problem, the fix, and behavioral changes, directly relating to all modifications across the three changed files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ace-191

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-06-19T12:54:19Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Andrei Lepikhov and others added 2 commits June 15, 2026 15:34

danolivo added the enhancement New feature or request label Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce table-diff `max_diff_rows` per node pair#123

Enforce table-diff `max_diff_rows` per node pair#123
danolivo wants to merge 2 commits into
mainfrom
ace-191

danolivo commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danolivo commented Jun 19, 2026

Problem

Fix

Behaviour change (operator-visible)

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 19, 2026

Up to standards ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading