Skip to content

Automation run observability — follow-ups (retention, discoverable Runs surface, durable run detail) #2585

Description

@os-zhuang

Context — what just shipped

Automation run observability now works end-to-end:

This tracking issue collects the follow-ups that surfaced while building + dogfooding that work. They are related enough to plan together; each can ship independently.


1. Retention / lifecycle for sys_automation_runP1 (regression risk #2581 introduced)

Problem. Before #2581, run observability was bounded: suspended rows were deleted on completion, and terminal runs lived only in the in-memory executionLogs ring buffer (maxLogSize). #2581 makes every terminal run persist forever — the table now grows without bound, once per flow execution. A busy tenant with per-record-change flows will accumulate rows indefinitely. (See [[data-lifecycle-adr-0057]]: the dev.db bloat root-cause was exactly "platform self-telemetry with no retention contract".)

Why it matters. This is a gap the feature introduced. Unbounded growth → DB bloat, slow listHistory reads, eventual operational pain — the same failure mode ADR-0057 exists to prevent.

Approach (sketch).

  • Bring sys_automation_run terminal history under ADR-0057's declarative lifecycle: a retention policy on the object (e.g. keep N days, or last K terminal runs per flow), pruned by the existing lifecycle sweep. Suspended (status:'paused') rows are not subject to retention — they are live resumable state.
  • Alternatively (or additionally) a cheap cap at write time in recordTerminal (trim to the newest K per flow) as a stop-gap until the ADR-0057 sweep covers it.
  • Decide default retention (proposal: 30 days or 100 terminal runs/flow, whichever first; configurable).

Files. packages/services/service-automation/src/suspended-run-store.ts (recordTerminal / a prune path), sys-automation-run.object.ts (retention declaration), ADR-0057 lifecycle machinery.


2. Make run history a first-class, discoverable surfaceP2 (objectui)

Problem. Durable run history now exists, but it is hard to find:

  • FlowRunsPanel only renders inside the flow inspector preview (FlowPreview) — you must open a specific flow's designer to see its runs.
  • The dedicated apps/console/src/pages/developer/FlowRunsPage.tsx route (/developer/flow-runs) redirects to /home in the dev console build (route gated / not mounted), so the intended "pick a flow → see runs" page is unreachable.

Why it matters. The value of #2581 is only realized if a builder can actually see "did my automation run / fail, and why". Today the answer is buried.

Approach (sketch). Add a proper Runs tab/section to the Studio Automations pillar (StudioDesignSurface.tsx) — list runs across the package's flows (status, start, duration, failure reason), drill into one. Reuse FlowRunsPanel / the engine listRuns endpoint. Reconcile with, or un-gate, the existing /developer/flow-runs page (decide which is canonical). See [[builder-ui-pillars-studio]].

Files. objectui packages/app-shell/src/views/studio-design/StudioDesignSurface.tsx, previews/FlowRunsPanel.tsx, apps/console/src/pages/developer/FlowRunsPage.tsx (+ its route registration).


3. Durable single-run detail (getRun) + step persistence — P3 (finish the #2581 story)

Problem. AutomationEngine.listRuns is durable (merges history), but getRun(runId) is still in-memory only — after a restart, clicking a failed run to see its detail returns null. Durable history rows also carry no step detail (steps: []), so even a durable getRun would lack per-node context.

Why it matters. The list already carries status + top-level error (the primary "what/why"), so this is genuinely a follow-up, not a hole — but "open a past failed run and see which node blew up" is the natural next question and it breaks across a restart.

Approach (sketch).

  • Add loadTerminal(runId) to SuspendedRunStore and have getRun fall back to it (reuse the run_-prefixed lookup).
  • Persist a compact step log on terminal rows (a steps_json column, bounded) so durable detail is meaningful. Watch row size / retention interplay with item 1.

Files. engine.ts (getRun), suspended-run-store.ts (loadTerminal + serialize steps), sys-automation-run.object.ts (steps column), runtime/src/http-dispatcher.ts (GET /:name/runs/:runId).


Minor / optional — normalize the error shape

The run-level error is a string (ExecutionLog.error) while a step-level error is a { code, message } object. #2230 made the UI tolerant of both, but normalizing at the engine/spec layer (one shape) would remove a recurring foot-gun. Low priority; note only.


Suggested sequencing

  1. Add metamodel interfaces for ObjectQL/ObjectUI contract #1 retention first — it closes the risk feat(automation): durable run history with failure reasons (run observability) #2581 introduced and is mostly framework + ADR-0057-aligned.
  2. ✨ Set up Copilot instructions #2 discoverable Runs surface — highest user-visible payoff (objectui).
  3. Implement ObjectStack protocol specification with Zod schemas and TypeScript interfaces #3 durable single-run detail — rounds out the backend once retention bounds row growth (so persisting steps is safe).

References: framework #2581, objectui #2230; ADR-0019 (suspended-run persistence), ADR-0057 (data lifecycle).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions