You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Automation run observability now works end-to-end:
framework feat(automation): durable run history with failure reasons (run observability) #2581 (ffafb30e8) — sys_automation_run became a durable run-history table. Every terminal run (completed / failed) is mirrored via SuspendedRunStore.recordTerminal; a failed run persists its error reason; AutomationEngine.listRuns merges durable history with the in-memory buffer. History rows use a run_-prefixed id, disjoint from ADR-0019 suspended rows.
This tracking issue collects the follow-ups that surfaced while building + dogfooding that work. They are related enough to plan together; each can ship independently.
Problem. Before #2581, run observability was bounded: suspended rows were deleted on completion, and terminal runs lived only in the in-memory executionLogs ring buffer (maxLogSize). #2581 makes every terminal run persist forever — the table now grows without bound, once per flow execution. A busy tenant with per-record-change flows will accumulate rows indefinitely. (See [[data-lifecycle-adr-0057]]: the dev.db bloat root-cause was exactly "platform self-telemetry with no retention contract".)
Why it matters. This is a gap the feature introduced. Unbounded growth → DB bloat, slow listHistory reads, eventual operational pain — the same failure mode ADR-0057 exists to prevent.
Approach (sketch).
Bring sys_automation_runterminal history under ADR-0057's declarative lifecycle: a retention policy on the object (e.g. keep N days, or last K terminal runs per flow), pruned by the existing lifecycle sweep. Suspended (status:'paused') rows are not subject to retention — they are live resumable state.
Alternatively (or additionally) a cheap cap at write time in recordTerminal (trim to the newest K per flow) as a stop-gap until the ADR-0057 sweep covers it.
Decide default retention (proposal: 30 days or 100 terminal runs/flow, whichever first; configurable).
2. Make run history a first-class, discoverable surface — P2 (objectui)
Problem. Durable run history now exists, but it is hard to find:
FlowRunsPanel only renders inside the flow inspector preview (FlowPreview) — you must open a specific flow's designer to see its runs.
The dedicated apps/console/src/pages/developer/FlowRunsPage.tsx route (/developer/flow-runs) redirects to /home in the dev console build (route gated / not mounted), so the intended "pick a flow → see runs" page is unreachable.
Why it matters. The value of #2581 is only realized if a builder can actually see "did my automation run / fail, and why". Today the answer is buried.
Approach (sketch). Add a proper Runs tab/section to the Studio Automations pillar (StudioDesignSurface.tsx) — list runs across the package's flows (status, start, duration, failure reason), drill into one. Reuse FlowRunsPanel / the engine listRuns endpoint. Reconcile with, or un-gate, the existing /developer/flow-runs page (decide which is canonical). See [[builder-ui-pillars-studio]].
Files. objectui packages/app-shell/src/views/studio-design/StudioDesignSurface.tsx, previews/FlowRunsPanel.tsx, apps/console/src/pages/developer/FlowRunsPage.tsx (+ its route registration).
Problem.AutomationEngine.listRuns is durable (merges history), but getRun(runId) is still in-memory only — after a restart, clicking a failed run to see its detail returns null. Durable history rows also carry no step detail (steps: []), so even a durable getRun would lack per-node context.
Why it matters. The list already carries status + top-level error (the primary "what/why"), so this is genuinely a follow-up, not a hole — but "open a past failed run and see which node blew up" is the natural next question and it breaks across a restart.
Approach (sketch).
Add loadTerminal(runId) to SuspendedRunStore and have getRun fall back to it (reuse the run_-prefixed lookup).
Persist a compact step log on terminal rows (a steps_json column, bounded) so durable detail is meaningful. Watch row size / retention interplay with item 1.
The run-level error is a string (ExecutionLog.error) while a step-level error is a { code, message } object. #2230 made the UI tolerant of both, but normalizing at the engine/spec layer (one shape) would remove a recurring foot-gun. Low priority; note only.
Context — what just shipped
Automation run observability now works end-to-end:
ffafb30e8) —sys_automation_runbecame a durable run-history table. Every terminal run (completed / failed) is mirrored viaSuspendedRunStore.recordTerminal; a failed run persists itserrorreason;AutomationEngine.listRunsmerges durable history with the in-memory buffer. History rows use arun_-prefixed id, disjoint from ADR-0019 suspended rows.34b92accd) — the Studio flow Runs panel (FlowRunsPanel) now renders a failed run's reason (it was read asrun.error?.message, but the engine sendsExecutionLog.erroras a string — silently dropped).This tracking issue collects the follow-ups that surfaced while building + dogfooding that work. They are related enough to plan together; each can ship independently.
1. Retention / lifecycle for
sys_automation_run— P1 (regression risk #2581 introduced)Problem. Before #2581, run observability was bounded: suspended rows were deleted on completion, and terminal runs lived only in the in-memory
executionLogsring buffer (maxLogSize). #2581 makes every terminal run persist forever — the table now grows without bound, once per flow execution. A busy tenant with per-record-change flows will accumulate rows indefinitely. (See [[data-lifecycle-adr-0057]]: the dev.db bloat root-cause was exactly "platform self-telemetry with no retention contract".)Why it matters. This is a gap the feature introduced. Unbounded growth → DB bloat, slow
listHistoryreads, eventual operational pain — the same failure mode ADR-0057 exists to prevent.Approach (sketch).
sys_automation_runterminal history under ADR-0057's declarative lifecycle: a retention policy on the object (e.g. keep N days, or last K terminal runs per flow), pruned by the existing lifecycle sweep. Suspended (status:'paused') rows are not subject to retention — they are live resumable state.recordTerminal(trim to the newest K per flow) as a stop-gap until the ADR-0057 sweep covers it.Files.
packages/services/service-automation/src/suspended-run-store.ts(recordTerminal/ a prune path),sys-automation-run.object.ts(retention declaration), ADR-0057 lifecycle machinery.2. Make run history a first-class, discoverable surface — P2 (objectui)
Problem. Durable run history now exists, but it is hard to find:
FlowRunsPanelonly renders inside the flow inspector preview (FlowPreview) — you must open a specific flow's designer to see its runs.apps/console/src/pages/developer/FlowRunsPage.tsxroute (/developer/flow-runs) redirects to/homein the dev console build (route gated / not mounted), so the intended "pick a flow → see runs" page is unreachable.Why it matters. The value of #2581 is only realized if a builder can actually see "did my automation run / fail, and why". Today the answer is buried.
Approach (sketch). Add a proper Runs tab/section to the Studio Automations pillar (
StudioDesignSurface.tsx) — list runs across the package's flows (status, start, duration, failure reason), drill into one. ReuseFlowRunsPanel/ the enginelistRunsendpoint. Reconcile with, or un-gate, the existing/developer/flow-runspage (decide which is canonical). See [[builder-ui-pillars-studio]].Files. objectui
packages/app-shell/src/views/studio-design/StudioDesignSurface.tsx,previews/FlowRunsPanel.tsx,apps/console/src/pages/developer/FlowRunsPage.tsx(+ its route registration).3. Durable single-run detail (
getRun) + step persistence — P3 (finish the #2581 story)Problem.
AutomationEngine.listRunsis durable (merges history), butgetRun(runId)is still in-memory only — after a restart, clicking a failed run to see its detail returns null. Durable history rows also carry no step detail (steps: []), so even a durablegetRunwould lack per-node context.Why it matters. The list already carries
status+ top-levelerror(the primary "what/why"), so this is genuinely a follow-up, not a hole — but "open a past failed run and see which node blew up" is the natural next question and it breaks across a restart.Approach (sketch).
loadTerminal(runId)toSuspendedRunStoreand havegetRunfall back to it (reuse therun_-prefixed lookup).steps_jsoncolumn, bounded) so durable detail is meaningful. Watch row size / retention interplay with item 1.Files.
engine.ts(getRun),suspended-run-store.ts(loadTerminal+ serialize steps),sys-automation-run.object.ts(steps column),runtime/src/http-dispatcher.ts(GET /:name/runs/:runId).Minor / optional — normalize the error shape
The run-level error is a string (
ExecutionLog.error) while a step-level error is a{ code, message }object. #2230 made the UI tolerant of both, but normalizing at the engine/spec layer (one shape) would remove a recurring foot-gun. Low priority; note only.Suggested sequencing
References: framework #2581, objectui #2230; ADR-0019 (suspended-run persistence), ADR-0057 (data lifecycle).