fix: schedule 任务在数据库瞬时连接异常时重试而非直接判失败#130
Merged
dengyh merged 1 commit intoJun 24, 2026
Merged
Conversation
When the database endpoint is briefly unreachable (e.g. MySQL connect timeout, errno 110), `Schedule.objects.get` raised an OperationalError that was swallowed by a broad `except` and immediately marked the schedule as FAIL, killing in-flight plugin polls on a momentary DB blip. Transient DB connection errors (OperationalError / InterfaceError) now trigger a Celery retry with exponential backoff, and only fall back to FAIL once retries are exhausted. Non-connection errors keep the original fail-fast behavior. Add unit tests for both the retry and give-up paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #130 +/- ##
=========================================
Coverage ? 91.23%
=========================================
Files ? 38
Lines ? 1369
Branches ? 0
=========================================
Hits ? 1249
Misses ? 120
Partials ? 0
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景 / 问题
schedulecelery 任务入口的Schedule.objects.get(trace_id=...),在数据库端点瞬时不可达时(例如 MySQL 连接超时,errno 110:(2003, "Can't connect to MySQL server ... (110)"))会抛出OperationalError。当前实现用一个宽泛的except Exception捕获后立即把调度置为 FAIL,结果数据库只是抖动一下,正在轮询中的插件就被误判为失败。此外
_set_schedule_state本身也是一次数据库写,DB 未恢复时该写入同样会失败(仅打日志),调度既没被正确置失败、也不会再被轮询,可能卡死。改动
OperationalError/InterfaceError)与其它错误:DoesNotExist等):保持原有 fail-fast 行为不变。测试
tests/runtime/schedule/celery/test_tasks.py共 6 个用例全部通过(4 个原有 + 2 个新增)。影响 / 兼容性
schedule任务入口的异常处理,正常路径与对外行为不变。Made with Cursor