surface real exception_log cause#511
Conversation
Backport the exception_log message-quality work from main, adapted to v5_STABLE. The files have diverged enough that these do not cherry-pick cleanly, so the logic was reapplied against the current v5_STABLE code: - ed17523 "exception_log: replace 'unavailable' with informative discard message" -- collateral rows (those not themselves the failing command) now carry a message naming the active behaviour and the command that failed, instead of the opaque placeholder "unavailable". - b32ef95 "spock_apply: surface root cause when a discard is not attributable to a row" -- when failed_action is 0 (e.g. a COMMIT-time failure) the captured root cause is surfaced instead of a dangling command_counter pointer. - 1f5738c "fix: clean up exception_log error-propagation code" -- folded in for the error-propagation fix the two above build on. Adapting was non-trivial: v5_STABLE's log_insert_exception() dropped the message via "(failed) ? errmsg : NULL", so the matching row's initial_error_message never reached the log. That is corrected here so the real message actually lands. On top of the backport: - Tag every captured/surfaced message with its SQLSTATE ([SQLSTATE xxxxx] ...), in exception_log and the "caught initial exception" log line. exception_log has no sqlstate column, so carrying it inline is the cheapest way to make the root cause unambiguous (e.g. to tell a real constraint violation from a transient conflict). - Allocate the formatted messages in ApplyOperationContext (reset per row) rather than the long-lived TopTransactionContext current on the exception path, so logging every discarded row of a large transaction does not accumulate. Diagnostic change only: discard/disable/LSN behaviour is unchanged. Connection-class errors still rethrow before this path (b75cb4f) and never reach exception_log. Test: 013_exception_handling.pl Part 5 updated to the new contract -- no "unavailable"; failing row carries its SQLSTATE; bystander rows carry the informative discard message.
Reproduces a firewall-style replication outage: establish replication, block the provider's TCP port with iptables, commit a transaction during the outage (driven over the unix socket, which iptables does not touch), then restore the connection. Asserts the customer-relevant outcomes: rows committed during the outage do not reach the subscriber while blocked; in SUB_DISABLE mode the outage does not disable the subscription; after the block is lifted the subscription returns to replicating with no row loss; and the blip produces no spurious spock.exception_log entries. Whether iptables tears the connection down or merely stalls it is host dependent (over loopback it typically stalls), so the test reports which occurred as diag rather than asserting a specific teardown. Requires iptables usable as root or via passwordless sudo; skips cleanly otherwise. Not in the schedule -- run manually: PERL5LIB=t prove -v t/104_iptables_conn_block.pl
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
The SQLSTATE prefix added to captured/surfaced exception_log messages is valuable for real error codes (constraint violations, deadlocks, connection failures) but is just noise for a bare elog(ERROR), which defaults to ERRCODE_INTERNAL_ERROR (XX000) -- e.g. spock's own "logical replication did not find row to be updated". errmsg_with_sqlstate() now omits the prefix when the code is XX000 and keeps it otherwise. The "caught initial exception" log line and the initial_error_message capture are routed through the same helper so the rule is applied consistently. Updates the replication_set regress expected output: the not-found-row message no longer carries the XX000 prefix; the collateral discard rows keep their informative "discarded due to exception at command_counter N" text.
| Assert(error_message != NULL); | ||
| values[Anum_exception_log_error_message - 1] = |
There was a problem hiding this comment.
The Assert() here is not needed and should be removed.
spock_apply: surface real exception_log cause, not "unavailable"
Backport the exception_log message-quality work from main, adapted to
v5_STABLE. The files have diverged enough that these do not cherry-pick
cleanly, so the logic was reapplied against the current v5_STABLE code:
message" -- collateral rows (those not themselves the failing command)
now carry a message naming the active behaviour and the command that
failed, instead of the opaque placeholder "unavailable".
attributable to a row" -- when failed_action is 0 (e.g. a COMMIT-time
failure) the captured root cause is surfaced instead of a dangling
command_counter pointer.
in for the error-propagation fix the two above build on.
Adapting was non-trivial: v5_STABLE's log_insert_exception() dropped the
message via "(failed) ? errmsg : NULL", so the matching row's
initial_error_message never reached the log. That is corrected here so the
real message actually lands.
On top of the backport:
([SQLSTATE xxxxx] ...), in exception_log and the "caught initial
exception" log line. exception_log has no sqlstate column, so carrying
it inline is the cheapest way to make the root cause unambiguous (e.g.
to tell a real constraint violation from a transient conflict).
row) rather than the long-lived TopTransactionContext current on the
exception path, so logging every discarded row of a large transaction
does not accumulate.
Diagnostic change only: discard/disable/LSN behaviour is unchanged.
Connection-class errors still rethrow before this path (b75cb4f) and never
reach exception_log.
Test: 013_exception_handling.pl Part 5 updated to the new contract -- no
"unavailable"; failing row carries its SQLSTATE; bystander rows carry the
informative discard message.