Skip to content

fix: schedule 任务在数据库瞬时连接异常时重试而非直接判失败#130

Merged
dengyh merged 1 commit into
TencentBlueKing:masterfrom
dengyh:fix/schedule-retry-on-transient-db-error
Jun 24, 2026
Merged

fix: schedule 任务在数据库瞬时连接异常时重试而非直接判失败#130
dengyh merged 1 commit into
TencentBlueKing:masterfrom
dengyh:fix/schedule-retry-on-transient-db-error

Conversation

@dengyh

@dengyh dengyh commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

背景 / 问题

schedule celery 任务入口的 Schedule.objects.get(trace_id=...),在数据库端点瞬时不可达时(例如 MySQL 连接超时,errno 110:(2003, "Can't connect to MySQL server ... (110)"))会抛出 OperationalError。当前实现用一个宽泛的 except Exception 捕获后立即把调度置为 FAIL,结果数据库只是抖动一下,正在轮询中的插件就被误判为失败。

此外 _set_schedule_state 本身也是一次数据库写,DB 未恢复时该写入同样会失败(仅打日志),调度既没被正确置失败、也不会再被轮询,可能卡死。

改动

  • 区分瞬时连接类错误OperationalError / InterfaceError)与其它错误:
    • 瞬时连接错误:通过 Celery 重试本次轮询(指数退避,最多 6 次:5s→10s→…→60s)。报错发生在任务入口的第一次读取、尚无任何副作用,重试是安全的。
    • 重试耗尽后:才兜底置为 FAIL。
    • 其它错误(如 DoesNotExist 等):保持原有 fail-fast 行为不变。
  • 新增针对「重试」与「重试耗尽兜底失败」两条路径的单元测试。

测试

tests/runtime/schedule/celery/test_tasks.py 共 6 个用例全部通过(4 个原有 + 2 个新增)。

影响 / 兼容性

  • 仅改动 schedule 任务入口的异常处理,正常路径与对外行为不变。
  • 重试依赖 broker(RabbitMQ);DB 抖动时 broker 通常正常,可正常重新排队。

Made with Cursor

When the database endpoint is briefly unreachable (e.g. MySQL connect
timeout, errno 110), `Schedule.objects.get` raised an OperationalError
that was swallowed by a broad `except` and immediately marked the
schedule as FAIL, killing in-flight plugin polls on a momentary DB blip.

Transient DB connection errors (OperationalError / InterfaceError) now
trigger a Celery retry with exponential backoff, and only fall back to
FAIL once retries are exhausted. Non-connection errors keep the original
fail-fast behavior. Add unit tests for both the retry and give-up paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@02494cd). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #130   +/-   ##
=========================================
  Coverage          ?   91.23%           
=========================================
  Files             ?       38           
  Lines             ?     1369           
  Branches          ?        0           
=========================================
  Hits              ?     1249           
  Misses            ?      120           
  Partials          ?        0           
Files with missing lines Coverage Δ
..._plugin_framework/runtime/schedule/celery/tasks.py 89.58% <100.00%> (ø)
...mework/tests/runtime/schedule/celery/test_tasks.py 100.00% <100.00%> (ø)

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02494cd...c548b6c. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dengyh dengyh merged commit b1eb2d0 into TencentBlueKing:master Jun 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants