Conversation
Zaid re-reported ACE-189 on a 3-node cluster: with n3 holding the cluster max, every diff pair involving n3 was short by exactly one row (its largest), while n1/n2 was correct. Reproduced his exact scenario faithfully against the current branch — it now produces the correct 5/10/5 counts (id=2010 included), confirming bb94210/73c3ccd/f9a3961 already cover it. His earlier observation must have predated the fix in his checkout. These two guards lock that coverage in; both exercise a MULTI-LEAF tree (row counts above the cluster min block size of 1000), the case the existing single-leaf bidirectional tests never reached — the boundary where the last closed leaf [B, max] meets the open tail [max, NULL): * TestMerkleTreeReferenceMaxInMultiLeafTail — minimal 2-node form: the reference (n2, more rows) holds the cluster max; asserts the full {2001..2010} diff including the tail row. * TestMerkleTreeThreeNodeReferenceTail — faithful 3-node reproduction. The shared test_cluster.json only registers n1/n2, so this emits its own test_cluster_3node.json via the new writeClusterConfigJSON helper and points the task at it, leaving the rest of the suite (and its Nodes="all" behaviour) untouched. Asserts n1/n3=10, n2/n3=5, n1/n2=5 — Zaid's exact shape. Full mtree suite green with both added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
table-diff tracked max_diff_rows with a single task-wide counter
(totalDiffRows), incremented by every differing row regardless of which
node pair it belonged to and reset once per ExecuteTask. On a cluster
with more than two nodes this turned the cap into one budget shared
across all C(n,2) pairs: once the combined total reached the limit,
every worker's stop-check fired and enumeration halted, leaving the
report truncated and the reported differences split arbitrarily across
pairs by whichever concurrent workers consumed the budget first.
In the reported case (n1=1.1M rows, n2=n3=100k) the true divergence is
~1M rows on each of the two n1 pairs (n2/n3 identical). With the global
cap of 1,000,000 only half was enumerated in one pass, split 510599
(n1/n3) + 489401 (n1/n2) = exactly the cap — requiring multiple
diff+repair cycles to converge. A single pair (2-node cluster) was
unaffected because "total" and "the pair" were the same thing; the
latent assumption only surfaced for N>2.
Make the limit per node pair: replace totalDiffRows with a
sync.Map of pairKey -> *atomic.Int64 and route every limit check and
increment through shouldStopPair(pairKey) / incrementPairDiffRowsLocked.
The pre-diff initial-hash loop now uses only the error circuit-breaker
(no rows are counted there, so the limit was always dead at that point).
Each pair now enumerates up to max_diff_rows independently; a pair under
the cap reports all its differences instead of being starved by another
pair's divergence.
Behaviour change worth noting: on N-node clusters the worst-case report
size is now max_diff_rows x number_of_pairs rather than max_diff_rows
total. A genuinely huge divergence therefore produces a larger report
rather than silently truncating — the operator should resync such a node
rather than repair row-by-row. (Docs follow separately.)
Tests:
* TestTableDiffMaxDiffRowsPerPair (new, 3-node) — n1 holds 1..20; n2/n3
each hold the same {1..5, 16..20} (10 middle rows missing, avoiding the
open-tail boundary confound). With max_diff_rows=14 (> per-pair 10,
< cross-pair total 20): before this fix the run stops at 14 total
(n1/n3=10, n1/n2=4) and sets DiffRowLimitReached; after, both pairs
report all 10 and the limit is not tripped. Uses its own
test_cluster_3node.json via the new writeThreeNodeClusterConfig helper,
leaving the shared 2-node cluster config untouched.
* Existing 2-node MaxDiffRowsLimit tests (simple + composite PK) still
pass — the per-pair cap collapses to the old behaviour for one pair.
go build / go vet clean; table-diff simple/composite/until suites green.
Author: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
ChangesPer-pair diff row budget
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Problem
table-diffenforcedmax_diff_rowswith a single task-wide counter (totalDiffRows), incremented by every differing row regardless of which node pair it belonged to and reset once perExecuteTask. On a cluster with more than two nodes this turned the cap into one budget shared across all C(n,2) pairs: once the combined total reached the limit, every worker's stop-check fired and enumeration halted — leaving the report truncated and the reported differences split arbitrarily across pairs by whichever concurrent workers consumed the budget first.In the reported case (n1 = 1,100,006 rows; n2 = n3 = 100,006) the true divergence is ~1,000,000 rows on each of the two n1 pairs (n2/n3 identical). With the global cap of 1,000,000, only half was enumerated in one pass — split
510599 (n1/n3) + 489401 (n1/n2) = exactly the cap— so achieving consistency required multiple sequential diff+repair cycles. A 2-node cluster was never affected because "total" and "the pair" were the same thing; the latent assumption only surfaced for N > 2.Fix
Make the limit per node pair:
totalDiffRows atomic.Int64withpairDiffRows sync.Map(pairKey -> *atomic.Int64).shouldStopPair(pairKey)/incrementPairDiffRowsLocked(pairKey, …); add apairKeyForhelper that centralises the canonical (lexically ordered) pair key previously computed inline.totalDiffRows,shouldStop,shouldStopDueToLimit,incrementDiffRowsLocked.Each pair now enumerates up to
max_diff_rowsindependently; a pair under the cap reports all of its differences instead of being starved by another pair's divergence. The cap is best-effort (a pair may report a handful of rows beyond the limit under concurrency — same as the prior global counter), documented onshouldStopPair.Behaviour change (operator-visible)
On N-node clusters the worst-case report size is now
max_diff_rows × number_of_pairsrather thanmax_diff_rowstotal. A genuinely huge divergence therefore produces a larger report instead of silently truncating; such a node should be resynced rather than repaired row-by-row. Docs update (per-pair semantics + report-size note) is tracked separately and not in this PR.