Automation run observability — follow-ups (retention, discoverable Runs surface, durable run detail)

## Context — what just shipped

Automation **run observability** now works end-to-end:

- **framework #2581** (`ffafb30e8`) — `sys_automation_run` became a durable **run-history** table. Every *terminal* run (completed / failed) is mirrored via `SuspendedRunStore.recordTerminal`; a failed run persists its `error` reason; `AutomationEngine.listRuns` merges durable history with the in-memory buffer. History rows use a `run_`-prefixed id, disjoint from ADR-0019 suspended rows.
- **objectui #2230** (`34b92accd`) — the Studio flow **Runs** panel (`FlowRunsPanel`) now renders a failed run's reason (it was read as `run.error?.message`, but the engine sends `ExecutionLog.error` as a **string** — silently dropped).

This tracking issue collects the follow-ups that surfaced while building + dogfooding that work. They are related enough to plan together; each can ship independently.

---

## 1. Retention / lifecycle for `sys_automation_run` — **P1 (regression risk #2581 introduced)**

**Problem.** Before #2581, run observability was bounded: suspended rows were deleted on completion, and terminal runs lived only in the in-memory `executionLogs` ring buffer (`maxLogSize`). #2581 makes **every terminal run persist forever** — the table now grows without bound, once per flow execution. A busy tenant with per-record-change flows will accumulate rows indefinitely. (See [[data-lifecycle-adr-0057]]: the dev.db bloat root-cause was exactly "platform self-telemetry with no retention contract".)

**Why it matters.** This is a gap the feature *introduced*. Unbounded growth → DB bloat, slow `listHistory` reads, eventual operational pain — the same failure mode ADR-0057 exists to prevent.

**Approach (sketch).**
- Bring `sys_automation_run` **terminal history** under ADR-0057's declarative lifecycle: a retention policy on the object (e.g. keep N days, or last K terminal runs per flow), pruned by the existing lifecycle sweep. Suspended (`status:'paused'`) rows are **not** subject to retention — they are live resumable state.
- Alternatively (or additionally) a cheap cap at write time in `recordTerminal` (trim to the newest K per flow) as a stop-gap until the ADR-0057 sweep covers it.
- Decide default retention (proposal: 30 days *or* 100 terminal runs/flow, whichever first; configurable).

**Files.** `packages/services/service-automation/src/suspended-run-store.ts` (`recordTerminal` / a prune path), `sys-automation-run.object.ts` (retention declaration), ADR-0057 lifecycle machinery.

---

## 2. Make run history a **first-class, discoverable surface** — **P2 (objectui)**

**Problem.** Durable run history now exists, but it is hard to find:
- `FlowRunsPanel` only renders inside the flow **inspector preview** (`FlowPreview`) — you must open a specific flow's designer to see its runs.
- The dedicated `apps/console/src/pages/developer/FlowRunsPage.tsx` route (`/developer/flow-runs`) **redirects to `/home`** in the dev console build (route gated / not mounted), so the intended "pick a flow → see runs" page is unreachable.

**Why it matters.** The value of #2581 is only realized if a builder can actually *see* "did my automation run / fail, and why". Today the answer is buried.

**Approach (sketch).** Add a proper **Runs** tab/section to the Studio **Automations** pillar (`StudioDesignSurface.tsx`) — list runs across the package's flows (status, start, duration, failure reason), drill into one. Reuse `FlowRunsPanel` / the engine `listRuns` endpoint. Reconcile with, or un-gate, the existing `/developer/flow-runs` page (decide which is canonical). See [[builder-ui-pillars-studio]].

**Files.** objectui `packages/app-shell/src/views/studio-design/StudioDesignSurface.tsx`, `previews/FlowRunsPanel.tsx`, `apps/console/src/pages/developer/FlowRunsPage.tsx` (+ its route registration).

---

## 3. Durable **single-run detail** (`getRun`) + step persistence — **P3 (finish the #2581 story)**

**Problem.** `AutomationEngine.listRuns` is durable (merges history), but `getRun(runId)` is still **in-memory only** — after a restart, clicking a failed run to see its detail returns null. Durable history rows also carry no step detail (`steps: []`), so even a durable `getRun` would lack per-node context.

**Why it matters.** The list already carries `status` + top-level `error` (the primary "what/why"), so this is genuinely a follow-up, not a hole — but "open a past failed run and see which node blew up" is the natural next question and it breaks across a restart.

**Approach (sketch).**
- Add `loadTerminal(runId)` to `SuspendedRunStore` and have `getRun` fall back to it (reuse the `run_`-prefixed lookup).
- Persist a compact step log on terminal rows (a `steps_json` column, bounded) so durable detail is meaningful. Watch row size / retention interplay with item 1.

**Files.** `engine.ts` (`getRun`), `suspended-run-store.ts` (`loadTerminal` + serialize steps), `sys-automation-run.object.ts` (steps column), `runtime/src/http-dispatcher.ts` (`GET /:name/runs/:runId`).

---

## Minor / optional — normalize the error shape

The run-level error is a **string** (`ExecutionLog.error`) while a step-level error is a `{ code, message }` object. #2230 made the UI tolerant of both, but normalizing at the engine/spec layer (one shape) would remove a recurring foot-gun. Low priority; note only.

---

## Suggested sequencing

1. **#1 retention** first — it closes the risk #2581 introduced and is mostly framework + ADR-0057-aligned.
2. **#2 discoverable Runs surface** — highest user-visible payoff (objectui).
3. **#3 durable single-run detail** — rounds out the backend once retention bounds row growth (so persisting steps is safe).

References: framework #2581, objectui #2230; ADR-0019 (suspended-run persistence), ADR-0057 (data lifecycle).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automation run observability — follow-ups (retention, discoverable Runs surface, durable run detail) #2585

Context — what just shipped

1. Retention / lifecycle for `sys_automation_run` — P1 (regression risk #2581 introduced)

2. Make run history a first-class, discoverable surface — P2 (objectui)

3. Durable single-run detail (`getRun`) + step persistence — P3 (finish the #2581 story)

Minor / optional — normalize the error shape

Suggested sequencing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Automation run observability — follow-ups (retention, discoverable Runs surface, durable run detail) #2585

Description

Context — what just shipped

1. Retention / lifecycle for sys_automation_run — P1 (regression risk #2581 introduced)

2. Make run history a first-class, discoverable surface — P2 (objectui)

3. Durable single-run detail (getRun) + step persistence — P3 (finish the #2581 story)

Minor / optional — normalize the error shape

Suggested sequencing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Retention / lifecycle for `sys_automation_run` — P1 (regression risk #2581 introduced)

3. Durable single-run detail (`getRun`) + step persistence — P3 (finish the #2581 story)