Skip to content

surface real exception_log cause#511

Open
mason-sharp wants to merge 3 commits into
v5_STABLEfrom
v5_exception_msg
Open

surface real exception_log cause#511
mason-sharp wants to merge 3 commits into
v5_STABLEfrom
v5_exception_msg

Conversation

@mason-sharp

Copy link
Copy Markdown
Member

spock_apply: surface real exception_log cause, not "unavailable"
Backport the exception_log message-quality work from main, adapted to
v5_STABLE. The files have diverged enough that these do not cherry-pick
cleanly, so the logic was reapplied against the current v5_STABLE code:

  • ed17523 "exception_log: replace 'unavailable' with informative discard
    message" -- collateral rows (those not themselves the failing command)
    now carry a message naming the active behaviour and the command that
    failed, instead of the opaque placeholder "unavailable".
  • b32ef95 "spock_apply: surface root cause when a discard is not
    attributable to a row" -- when failed_action is 0 (e.g. a COMMIT-time
    failure) the captured root cause is surfaced instead of a dangling
    command_counter pointer.
  • 1f5738c "fix: clean up exception_log error-propagation code" -- folded
    in for the error-propagation fix the two above build on.

Adapting was non-trivial: v5_STABLE's log_insert_exception() dropped the
message via "(failed) ? errmsg : NULL", so the matching row's
initial_error_message never reached the log. That is corrected here so the
real message actually lands.

On top of the backport:

  • Tag every captured/surfaced message with its SQLSTATE
    ([SQLSTATE xxxxx] ...), in exception_log and the "caught initial
    exception" log line. exception_log has no sqlstate column, so carrying
    it inline is the cheapest way to make the root cause unambiguous (e.g.
    to tell a real constraint violation from a transient conflict).
  • Allocate the formatted messages in ApplyOperationContext (reset per
    row) rather than the long-lived TopTransactionContext current on the
    exception path, so logging every discarded row of a large transaction
    does not accumulate.

Diagnostic change only: discard/disable/LSN behaviour is unchanged.
Connection-class errors still rethrow before this path (b75cb4f) and never
reach exception_log.

Test: 013_exception_handling.pl Part 5 updated to the new contract -- no
"unavailable"; failing row carries its SQLSTATE; bystander rows carry the
informative discard message.

Backport the exception_log message-quality work from main, adapted to
v5_STABLE.  The files have diverged enough that these do not cherry-pick
cleanly, so the logic was reapplied against the current v5_STABLE code:

  - ed17523 "exception_log: replace 'unavailable' with informative discard
    message" -- collateral rows (those not themselves the failing command)
    now carry a message naming the active behaviour and the command that
    failed, instead of the opaque placeholder "unavailable".
  - b32ef95 "spock_apply: surface root cause when a discard is not
    attributable to a row" -- when failed_action is 0 (e.g. a COMMIT-time
    failure) the captured root cause is surfaced instead of a dangling
    command_counter pointer.
  - 1f5738c "fix: clean up exception_log error-propagation code" -- folded
    in for the error-propagation fix the two above build on.

Adapting was non-trivial: v5_STABLE's log_insert_exception() dropped the
message via "(failed) ? errmsg : NULL", so the matching row's
initial_error_message never reached the log.  That is corrected here so the
real message actually lands.

On top of the backport:

  - Tag every captured/surfaced message with its SQLSTATE
    ([SQLSTATE xxxxx] ...), in exception_log and the "caught initial
    exception" log line.  exception_log has no sqlstate column, so carrying
    it inline is the cheapest way to make the root cause unambiguous (e.g.
    to tell a real constraint violation from a transient conflict).
  - Allocate the formatted messages in ApplyOperationContext (reset per
    row) rather than the long-lived TopTransactionContext current on the
    exception path, so logging every discarded row of a large transaction
    does not accumulate.

Diagnostic change only: discard/disable/LSN behaviour is unchanged.
Connection-class errors still rethrow before this path (b75cb4f) and never
reach exception_log.

Test: 013_exception_handling.pl Part 5 updated to the new contract -- no
"unavailable"; failing row carries its SQLSTATE; bystander rows carry the
informative discard message.
Reproduces a firewall-style replication outage: establish replication,
block the provider's TCP port with iptables, commit a transaction during
the outage (driven over the unix socket, which iptables does not touch),
then restore the connection.

Asserts the customer-relevant outcomes: rows committed during the outage
do not reach the subscriber while blocked; in SUB_DISABLE mode the outage
does not disable the subscription; after the block is lifted the
subscription returns to replicating with no row loss; and the blip produces
no spurious spock.exception_log entries.

Whether iptables tears the connection down or merely stalls it is host
dependent (over loopback it typically stalls), so the test reports which
occurred as diag rather than asserting a specific teardown.

Requires iptables usable as root or via passwordless sudo; skips cleanly
otherwise.  Not in the schedule -- run manually:
  PERL5LIB=t prove -v t/104_iptables_conn_block.pl
@mason-sharp mason-sharp requested a review from rasifr June 19, 2026 01:56
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8ea6c273-3251-4b50-a880-f37e963f5462

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v5_exception_msg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production

codacy-production Bot commented Jun 19, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

The SQLSTATE prefix added to captured/surfaced exception_log messages is
valuable for real error codes (constraint violations, deadlocks, connection
failures) but is just noise for a bare elog(ERROR), which defaults to
ERRCODE_INTERNAL_ERROR (XX000) -- e.g. spock's own "logical replication did
not find row to be updated".

errmsg_with_sqlstate() now omits the prefix when the code is XX000 and keeps
it otherwise.  The "caught initial exception" log line and the
initial_error_message capture are routed through the same helper so the rule
is applied consistently.

Updates the replication_set regress expected output: the not-found-row
message no longer carries the XX000 prefix; the collateral discard rows keep
their informative "discarded due to exception at command_counter N" text.
Comment on lines +195 to +196
Assert(error_message != NULL);
values[Anum_exception_log_error_message - 1] =

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Assert() here is not needed and should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants