From 8e95a8714b0e6dd2310b04cd015f3589c8b88036 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 13:55:55 +0200 Subject: [PATCH 01/10] docs: plan coding-agent target runtimes --- ...03-coding-agent-target-runtime-contract.md | 313 ++++++++++++++++++ 1 file changed, 313 insertions(+) create mode 100644 docs/plans/2026-07-03-coding-agent-target-runtime-contract.md diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md new file mode 100644 index 000000000..17de8f900 --- /dev/null +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -0,0 +1,313 @@ +--- +artifact_contract: ce-unified-plan/v1 +artifact_readiness: implementation-ready +product_contract_source: av-vrx8-research +execution: code +title: "Coding-agent target runtime contract" +created_at: 2026-07-03 +type: feature +bead: av-y7eq +--- + +# Coding-agent target runtime contract + +## Goal Capsule + +- **Objective:** Make AgentV's coding-agent targets reliable by default while + preserving rich transcripts and local "run the agent I use" workflows. +- **Core decision:** Target authoring uses the compact shape + `label` + `provider` + `runtime` + `config`. SDK-backed coding-agent + providers, when retained, default to internal process isolation rather than + importing risky agent SDKs in the AgentV orchestrator process. +- **Primary Bead:** `av-y7eq` +- **Implementation Beads:** `av-y7eq.1` through `av-y7eq.5`; existing SDK + subprocess follow-up `av-57i` / `av-57i.1`. +- **Non-goal:** Do not replace AgentV with Promptfoo, Symphony, Kata, Margin, or + Vercel agent-eval. Borrow their proven boundaries and keep AgentV's + repo-native run bundle model. + +## Summary + +AgentV should treat coding-agent targets as external runtimes to orchestrate, +not as libraries to call in-process by default. The default path should be +subprocess, protocol, or sandbox based: + +- Codex: `codex-app-server` first for rich protocol control, `codex-cli` as the + simpler process-boundary path, `codex-sdk` explicit and internally isolated. +- Pi: `pi-rpc` or `pi-cli` first, following Kata's `pi --mode rpc` pattern; + `pi-coding-agent`/`pi-sdk` explicit and internally isolated if retained. +- Claude: `claude-cli` first; `claude-sdk` explicit and internally isolated if + retained. There is no separate Claude app-server/RPC surface identified. +- Copilot: prefer CLI/session-log/process-boundary paths where possible; + `copilot-sdk` follows the same explicit SDK isolation rule. + +The target schema should not expose every implementation detail as a top-level +field. Runtime placement is a single concept: + +```yaml +targets: + - label: codex-local + provider: codex-app-server + runtime: host + config: + command: codex + model: gpt-5-codex +``` + +Expanded form is used only when needed: + +```yaml +targets: + - label: codex-clean + provider: codex-cli + runtime: + mode: profile + home: .agentv/profiles/codex-clean + config: + command: codex + model: gpt-5-codex +``` + +```yaml +targets: + - label: pi-rpc-local + provider: pi-rpc + runtime: host + config: + command: pi + model: gpt-5-codex +``` + +## Product Contract + +### Stable Fields + +| Field | Meaning | +| --- | --- | +| `label` | Human and result identity for the target. Used by CLI selection, run artifacts, Dashboard, and comparisons. | +| `provider` | Adapter/control protocol kind: `codex-cli`, `codex-app-server`, `codex-sdk`, `pi-cli`, `pi-rpc`, `claude-cli`, `claude-sdk`, etc. | +| `runtime` | Where and how the provider runs: `host`, `profile`, or `sandbox`. May be a string shorthand or an object with `mode`. | +| `config` | Provider-specific configuration. Keep `model`, `command`, timeouts, permission flags, and provider knobs here. | + +Do not add competing top-level fields such as `isolation`, `sandbox`, +`install`, `container`, `environment`, or `profile`. Those details live under +`runtime` or `config` only when a provider needs them. + +### Runtime Modes + +| Runtime | Boundary | Use case | +| --- | --- | --- | +| `host` | User's installed runtime and normal config/auth/skills/plugins. | Local research and "evaluate the exact agent I use." | +| `profile` | Host process execution with isolated home/config/env, such as `CODEX_HOME`, `HOME`, temp dirs, and explicit auth profile. | Cleaner local evals without full container cost. | +| `sandbox` | Separate execution substrate such as Docker, Vercel Sandbox, remote worker, or another container/sandbox backend. | CI, reproducibility, untrusted tasks, stronger crash and filesystem containment. | + +A sandbox may contain an internal profile, but the top-level runtime remains +`sandbox` because the execution substrate boundary is stronger than host-side +config isolation. + +### SDK Rule + +SDK-backed coding-agent providers are allowed only as explicit provider kinds +and should default to internal process isolation: + +```yaml +targets: + - label: codex-sdk-isolated + provider: codex-sdk + runtime: host + config: + model: gpt-5-codex +``` + +The YAML should not need an opt-in such as `sdk_isolation: process` for the +safe path. If AgentV cannot isolate an SDK provider yet, that provider should be +documented as explicit/non-default or temporarily rejected with an actionable +message. + +The parent AgentV process must not import the risky coding-agent SDK for the +default safe path. Instead, use a provider child runner: + +```text +AgentV parent + -> spawn child runner with target config + provider request JSON + <- NDJSON events/logs + <- one final ProviderResponse envelope + <- child exit status +``` + +Failure mapping: + +- child nonzero exit before result -> target error +- malformed child JSON -> target error +- timeout/cancel -> kill child process group, target timeout error +- crash after partial transcript -> failed target result with partial logs +- parent still finalizes `index.jsonl`, summaries, transcripts, and run bundle + +## External Pattern Mapping + +| Source | Relevant pattern | AgentV decision | +| --- | --- | --- | +| Promptfoo | Provider object uses `id`/`label`/`config`; Codex and Claude SDK providers put `model` in `config.model`; direct SDK adapters exist. | Keep `label`/`provider`/`config` ergonomics; keep `model` under `config`; do not make in-process SDK the default. | +| OpenAI Symphony | Codex app-server subprocess with workspace/session orchestration, approval/sandbox policy, max-turn boundaries, and structured streaming/status. | Use `codex-app-server` as the preferred rich-control Codex provider. | +| Kata Symphony | Pi is launched as `pi --mode rpc` locally or over SSH and controlled over stdio/RPC; workers must already have the runtime installed. | Add/prefer `pi-rpc` for rich Pi control; do not import Pi coding-agent SDK into AgentV's orchestrator. | +| Vercel agent-eval | Installs agent CLIs inside ephemeral sandboxes and captures transcripts from CLI JSON/session logs. | `runtime.mode: sandbox` should support managed/pinned CLI install and transcript capture without host config bleed. | +| Margin Evals | Runs cases in Docker, captures PTY/runtime/control logs, optional ATIF trajectory hooks. | Treat container/sandbox as runtime substrate and preserve logs/trajectories as run artifacts. | +| SWE-bench | Applies predictions and runs tests inside Docker containers with logs, timeouts, and cleanup. | Keep container details under runtime/harness config, not target identity. | +| DeepEval | Pytest/metric/tracing loop that coding agents can call, not a coding-agent target orchestrator. | Useful grader/eval-loop reference, not a target runtime model. | + +## Provider Contract + +### Codex + +Use explicit provider kinds: + +- `codex-cli`: spawn `codex exec` or a user shim. Capture stdout/stderr, JSONL + stream, exit code, final text, and raw logs. +- `codex-app-server`: spawn `codex app-server` or a user shim plus app-server + args. Prefer for rich transcript, turn/session control, cancellation, and + structured JSON-RPC events. +- `codex-sdk`: explicit SDK provider. Internally isolated in a child process if + retained. + +Do not add `codex-rpc` unless Codex exposes a distinct RPC mode separate from +app-server. For Codex, app-server is the protocol provider. + +`config.command` is the executable or shim, not the provider identity: + +```yaml +targets: + - label: codex-personal + provider: codex-cli + runtime: host + config: + command: codex-personal +``` + +### Pi + +Use explicit provider kinds: + +- `pi-cli`: simple Pi CLI subprocess and transcript capture. +- `pi-rpc`: Kata-style protocol subprocess that launches `pi --mode rpc` and + controls it over stdio/RPC. +- `pi-coding-agent` or `pi-sdk`: explicit SDK provider only; internally + isolated if retained. + +Keep `pi-ai` for plain LLM/model calls. Do not treat `pi-ai` as the coding-agent +runtime boundary. + +### Claude + +Use explicit provider kinds: + +- `claude-cli`: default subprocess path using structured stream output. +- `claude-sdk`: explicit SDK provider using `@anthropic-ai/claude-agent-sdk`, + internally isolated if retained. + +No separate Claude app-server/RPC provider has been identified. The CLI +structured stream is the subprocess-first rich transcript path. Claude Agent SDK +may spawn Claude Code internally, but importing the SDK in AgentV still creates +an in-process adapter risk unless wrapped by a child runner. + +### Copilot + +Keep provider names explicit by control boundary: + +- `copilot-cli`: subprocess/protocol CLI path. +- `copilot-log`: passive transcript/log replay path. +- `copilot-sdk`: explicit SDK path, internally isolated if retained. + +## Implementation Units + +### U1. Target Schema And Docs (`av-y7eq.1`) + +- Add `runtime: host` shorthand and `runtime.mode: host | profile | sandbox`. +- Keep `model` and `command` under `config`. +- Preserve `label` as target identity and `provider` as adapter/backend kind. +- Reject invalid runtime modes with focused validation errors. +- Document why `runtime` is the umbrella field. + +### U2. Codex Host/Profile Providers (`av-y7eq.2`) + +- Split current ambiguous `codex` registry behavior into explicit + `codex-cli`, `codex-app-server`, and `codex-sdk`. +- Make bare `codex`, if retained at all, alias to the chosen safe default + (`codex-app-server`) or reject it during the cleanup. It must not silently + select in-process SDK. +- Support `config.command` shims such as `codex-personal` and `codex-eng`. +- Implement host/profile environment construction, including deliberate + `HOME`, `CODEX_HOME`, temp dirs, and env allowlists for profile mode. + +### U3. Sandbox Runtime (`av-y7eq.3`) + +- Implement `runtime.mode: sandbox` using the existing or smallest viable + sandbox/container substrate. +- Install or locate the target CLI inside the sandbox with pinned/configurable + inputs. +- Mount only explicit workspace, result, and credential paths. +- Preserve stdout/stderr/transcript artifacts and distinguish sandbox infra + failure from target task failure. + +### U4. SDK Provider Isolation (`av-y7eq.4`, `av-57i`, `av-57i.1`) + +- Move retained coding-agent SDK providers behind child-runner process + boundaries. +- Start with Pi SDK isolation if that remains the quickest proof slice. +- Generalize only after the first provider proves the protocol. +- Do not install broad parent-process exception/EPIPE swallowing. + +### U5. Pi RPC Runtime (`av-y7eq.5`) + +- Add or document `pi-rpc` as the preferred rich-control Pi provider. +- Launch `pi --mode rpc` through a process/stdio boundary. +- Model remote execution after Kata only where AgentV needs it; worker + provisioning can remain explicit and out of scope for the first slice. +- Keep `pi-coding-agent` SDK explicit/non-default. + +## Result And Artifact Requirements + +Every coding-agent provider must return or fail through a structured result +envelope. AgentV must preserve: + +- target label, provider kind, runtime mode, command, cwd, and model +- stdout/stderr logs +- structured event transcript when available +- final assistant output +- tool/file-change events when available +- timeout, cancellation, spawn failure, nonzero exit, malformed output, and + crash metadata +- partial transcript/logs on failure + +Target crashes are target results. They must not become AgentV orchestrator +crashes. + +## Open Questions + +- Whether to keep a bare `codex` alias at all. If kept, it should resolve to the + safe default, not SDK. +- Whether to rename `pi-coding-agent` to `pi-sdk` during the major cleanup or + keep the existing provider name as an explicit legacy SDK provider. +- Which sandbox substrate should be the first implementation target if existing + AgentV runner support is insufficient. +- How much transcript normalization belongs in provider adapters versus a shared + transcript post-processor. + +## Validation Plan + +- Schema tests for `runtime` shorthand/object forms and invalid values. +- Provider registry tests proving explicit provider names and safe aliases. +- Codex CLI/app-server tests for command shims, host/profile env, timeout kill, + nonzero exit, malformed output, and transcript capture. +- Pi RPC tests with a fake `pi --mode rpc` process. +- SDK child-runner tests for success, child crash before result, child crash + after partial events, malformed JSON, timeout, and cancellation. +- Docs/examples validation after examples are updated. +- Live provider dogfood before implementation PRs are marked ready, per repo + verification rules. + +## Handoff + +Implementation workers should start with `av-y7eq.1` before provider changes so +the normalized contract exists. `av-y7eq.2` and `av-y7eq.5` can then proceed in +parallel for Codex and Pi subprocess/protocol providers. `av-y7eq.4` should +coordinate with `av-57i.1` rather than creating a second SDK isolation design. From ab9cb40b3adddfbf61d14fd6f1f3ce22672b981a Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:04:56 +0200 Subject: [PATCH 02/10] docs: preserve target config in runtime plan --- ...03-coding-agent-target-runtime-contract.md | 63 ++++++++++++++++++- 1 file changed, 61 insertions(+), 2 deletions(-) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index 17de8f900..8352a5649 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -51,6 +51,7 @@ targets: runtime: host config: command: codex + args: ["--config", "model_reasoning_effort=high"] model: gpt-5-codex ``` @@ -65,6 +66,7 @@ targets: home: .agentv/profiles/codex-clean config: command: codex + args: ["--sandbox", "workspace-write"] model: gpt-5-codex ``` @@ -87,12 +89,57 @@ targets: | `label` | Human and result identity for the target. Used by CLI selection, run artifacts, Dashboard, and comparisons. | | `provider` | Adapter/control protocol kind: `codex-cli`, `codex-app-server`, `codex-sdk`, `pi-cli`, `pi-rpc`, `claude-cli`, `claude-sdk`, etc. | | `runtime` | Where and how the provider runs: `host`, `profile`, or `sandbox`. May be a string shorthand or an object with `mode`. | -| `config` | Provider-specific configuration. Keep `model`, `command`, timeouts, permission flags, and provider knobs here. | +| `config` | Provider-specific configuration. Keep `model`, `command`, `args`, timeouts, permission flags, and provider knobs here. | Do not add competing top-level fields such as `isolation`, `sandbox`, `install`, `container`, `environment`, or `profile`. Those details live under `runtime` or `config` only when a provider needs them. +### Preserve Existing AgentV Surface + +This plan is a targeted provider-boundary cleanup, not a rewrite from Promptfoo +or another framework. Preserve AgentV's current target capabilities unless an +implementation Bead explicitly removes one. + +For coding-agent providers, `config.command` is the executable or shim identity, +such as `codex`, `codex-personal`, `pi`, or an absolute binary path. It may +also be a non-empty argv array where the first token is the executable and the +remaining tokens are extra arguments. Normalize that form internally to +executable plus argv tokens. `config.args` remains the explicit argv token array +for extra provider-specific arguments. Keep the existing `executable`/`binary` +compatibility aliases and the existing `args`/`arguments` array aliases during +migration. Do not require users to pass shell-joined command strings for +coding-agent providers. + +If `config.command` is an argv array, reject simultaneous `config.args` unless +the provider defines an unambiguous merge order. This keeps command resolution +predictable and avoids hidden shell parsing. + +The generic `provider: cli` path is different: it currently uses a command +template string with placeholders such as `{PROMPT}` and healthcheck support. +Keep that compatibility path intact while adding coding-agent-specific runtime +boundaries. + +Also preserve the common and provider-specific knobs already used by AgentV: + +- common target behavior: `grader_target`, `fallback_targets`, `workers`, + `subagent_mode_allowed`, env interpolation, `cwd`, and `timeout_seconds` +- artifact/log behavior: `stream_log`, `log_dir`/`log_directory`, stdout/stderr + capture, raw protocol events, and partial logs on failure +- Codex knobs: `model`, `reasoning_effort`/`model_reasoning_effort`, + `model_verbosity`, `base_url`/`endpoint`, `api_key`, `api_format`, + `sandbox_mode`, `approval_policy`, and `system_prompt` +- Pi knobs: `subprovider`, `model`/`pi_model`, `api_key`, `base_url`/`endpoint`, + `tools`/`pi_tools`, `thinking`/`pi_thinking`, `args`, and `system_prompt` +- Claude knobs: `model`, `max_turns`, `max_budget_usd`, + `bypass_permissions`, and `system_prompt` +- Copilot knobs: `model`, custom provider settings, GitHub token/auth knobs, + ACP/prompt execution behavior, `args`, and `system_prompt` + +Where the new normalized contract uses nested `config`, implement migration by +normalizing the existing flat target fields into that internal shape. Do not +drop existing accepted YAML fields as a side effect of adding `runtime`. + ### Runtime Modes | Runtime | Boundary | Use case | @@ -172,7 +219,9 @@ Use explicit provider kinds: Do not add `codex-rpc` unless Codex exposes a distinct RPC mode separate from app-server. For Codex, app-server is the protocol provider. -`config.command` is the executable or shim, not the provider identity: +`config.command` is the executable or shim, not the provider identity. Extra +arguments may be supplied with `config.args` or, for compact argv-style input, +as a command array: ```yaml targets: @@ -181,6 +230,16 @@ targets: runtime: host config: command: codex-personal + args: ["--model", "gpt-5-codex"] +``` + +```yaml +targets: + - label: codex-eng + provider: codex-cli + runtime: host + config: + command: ["codex-eng", "--model", "gpt-5-codex"] ``` ### Pi From 00fc889cdb3241ed3fedadd77c9ffbf17e59b3dd Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:06:13 +0200 Subject: [PATCH 03/10] docs: separate target runtime from scheduler policy --- ...2026-07-03-coding-agent-target-runtime-contract.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index 8352a5649..ac8c2098b 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -122,8 +122,8 @@ boundaries. Also preserve the common and provider-specific knobs already used by AgentV: -- common target behavior: `grader_target`, `fallback_targets`, `workers`, - `subagent_mode_allowed`, env interpolation, `cwd`, and `timeout_seconds` +- common target behavior: `use_target`, `grader_target`, `fallback_targets`, + env interpolation, `cwd`, and `timeout_seconds` - artifact/log behavior: `stream_log`, `log_dir`/`log_directory`, stdout/stderr capture, raw protocol events, and partial logs on failure - Codex knobs: `model`, `reasoning_effort`/`model_reasoning_effort`, @@ -140,6 +140,13 @@ Where the new normalized contract uses nested `config`, implement migration by normalizing the existing flat target fields into that internal shape. Do not drop existing accepted YAML fields as a side effect of adding `runtime`. +Do not promote orchestration scheduler fields into the new target runtime +contract. `workers`, `batch_requests`, and `subagent_mode_allowed` are existing +compatibility/runtime-policy fields, not part of the target's coding-agent +control boundary. Continue to handle them where AgentV already accepts them, +but prefer `--workers`, project `execution.workers`, `evaluate_options`, or +runtime policy for new scheduling behavior. + ### Runtime Modes | Runtime | Boundary | Use case | From 9324c29d1682d53d1f6c063488f1337d168596dc Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:08:46 +0200 Subject: [PATCH 04/10] docs: clarify grader target routing --- ...03-coding-agent-target-runtime-contract.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index ac8c2098b..a29a63494 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -147,6 +147,25 @@ control boundary. Continue to handle them where AgentV already accepts them, but prefer `--workers`, project `execution.workers`, `evaluate_options`, or runtime policy for new scheduling behavior. +`grader_target` is different. It is not a coding-agent runtime field, but the +concept is not redundant: coding-agent targets usually cannot act as structured +LLM graders, and AgentV workspaces often contain multiple LLM providers or +endpoints. AgentV still needs a default grader target selection. Preserve the +current resolution behavior while cleaning up provider runtimes: + +- CLI `--grader-target` is the strongest run-level override. +- Per-evaluator `target` remains the specific grader override. +- Target-level `grader_target` remains the compatibility/default grader for + that target until a clearer eval/project-level default is introduced. +- If a new canonical default is added later, prefer a grader/eval policy field + such as `default_grader_target` over putting grader selection inside + `runtime` or coding-agent provider `config`. + +Promptfoo's comparable mechanism is assertion/test grading provider selection: +assertions can set a `provider`, tests/defaultTest can provide fallback grading +providers, and model-graded matchers fall back to type-specific default grading +providers. It does not put grader selection in the target provider runtime. + ### Runtime Modes | Runtime | Boundary | Use case | From db5c39b26bba3582f5869437277b23cfff583a97 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:16:04 +0200 Subject: [PATCH 05/10] docs: clarify target registry file layout --- ...03-coding-agent-target-runtime-contract.md | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index a29a63494..204a45f40 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -166,6 +166,51 @@ assertions can set a `provider`, tests/defaultTest can provide fallback grading providers, and model-graded matchers fall back to type-specific default grading providers. It does not put grader selection in the target provider runtime. +### Project File Layout + +Keep registries separate from policy: + +```text +.agentv/ + config.yaml + targets.yaml + graders.yaml +``` + +Project-local `.agentv/config.yaml` should remain the portable project policy +file: defaults, `execution`, `eval_patterns`, `refs`, tags, result defaults, and +other run-level settings. It may point at the default target/grader by name, but +it should not become the registry that holds all target and grader definitions. + +`targets.yaml` should remain the registry of subjects under test. `graders.yaml` +should be the registry of reusable grading providers. This keeps target runtime +contracts reviewable, keeps grader credentials/endpoints separate from agent +runtimes, and matches AgentV's existing artifact model where run manifests carry +explicit `targets_path` and `graders_path` entries. + +The global `$AGENTV_HOME/config.yaml` is different: it owns Dashboard/operator +state such as the `projects:` registry. Do not use the existence of global +`projects:` as a reason to put project-local target/grader registries into +project-local `.agentv/config.yaml`. If custom locations are needed, add +project-local config pointers such as `targets_file` / `graders_file` rather +than embedding both registries inline. + +Greenfield, the cleanest global shape would put Dashboard project registry +state in `$AGENTV_HOME/projects.yaml` and leave `$AGENTV_HOME/config.yaml` for +global settings. Current AgentV code and docs use `$AGENTV_HOME/config.yaml` +with a top-level `projects:` registry, so do not migrate this as part of the +coding-agent target-runtime work. If the team wants the cleaner split, create a +separate migration Bead with backwards-compatible reading from current +`projects:` locations and a clear write target. + +Promptfoo's comparable file-structure guidance is simpler: a main +`promptfooconfig.yaml` commonly contains `providers`, `prompts`, `defaultTest`, +and `tests`, while larger configs can reference external files such as provider +YAML with `file://...`. Promptfoo does not have AgentV's separate home-scoped +Dashboard project registry, so it is useful as a modular-config reference but +not a direct reason to collapse AgentV's project, target, and grader registries +into one file. + ### Runtime Modes | Runtime | Boundary | Use case | From d1e42e658f067b5034a8b9ecd8d12f8317d5879a Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:19:54 +0200 Subject: [PATCH 06/10] docs: align config references with promptfoo style --- ...03-coding-agent-target-runtime-contract.md | 60 ++++++++++++++++--- 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index 204a45f40..986b2f5b0 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -181,6 +181,27 @@ Project-local `.agentv/config.yaml` should remain the portable project policy file: defaults, `execution`, `eval_patterns`, `refs`, tags, result defaults, and other run-level settings. It may point at the default target/grader by name, but it should not become the registry that holds all target and grader definitions. +Following Promptfoo's modular-config idiom, use direct field references rather +than a named import table: + +```yaml +# .agentv/config.yaml +targets: file://targets.yaml +graders: file://graders.yaml + +defaults: + target: codex-local + grader: openai-grader + +execution: + workers: 3 +``` + +Do not introduce a greenfield `files:` or `imports:` section for this unless +AgentV needs a capability that direct field references cannot express. +Promptfoo's pattern is `providers: file://configs/providers.yaml`, +`tests: file://tests/`, and `defaultTest: file://configs/default-test.yaml`; +the field being configured names the thing being loaded. `targets.yaml` should remain the registry of subjects under test. `graders.yaml` should be the registry of reusable grading providers. This keeps target runtime @@ -188,20 +209,43 @@ contracts reviewable, keeps grader credentials/endpoints separate from agent runtimes, and matches AgentV's existing artifact model where run manifests carry explicit `targets_path` and `graders_path` entries. +For Promptfoo-style field references, the referenced file should contain the +value for that field. Greenfield examples: + +```yaml +# .agentv/targets.yaml +- id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex"] +``` + +```yaml +# .agentv/graders.yaml +- id: openai-grader + provider: openai + config: + model: gpt-5-mini +``` + +For compatibility with AgentV's existing standalone `targets.yaml` convention, +the loader can also accept wrapped forms such as `targets: [...]` and +`graders: [...]`, but the Promptfoo-like authored shape is the bare field value. + The global `$AGENTV_HOME/config.yaml` is different: it owns Dashboard/operator state such as the `projects:` registry. Do not use the existence of global `projects:` as a reason to put project-local target/grader registries into -project-local `.agentv/config.yaml`. If custom locations are needed, add -project-local config pointers such as `targets_file` / `graders_file` rather -than embedding both registries inline. +project-local `.agentv/config.yaml`. Greenfield, the cleanest global shape would put Dashboard project registry state in `$AGENTV_HOME/projects.yaml` and leave `$AGENTV_HOME/config.yaml` for -global settings. Current AgentV code and docs use `$AGENTV_HOME/config.yaml` -with a top-level `projects:` registry, so do not migrate this as part of the -coding-agent target-runtime work. If the team wants the cleaner split, create a -separate migration Bead with backwards-compatible reading from current -`projects:` locations and a clear write target. +global settings. If using Promptfoo-style references, the global config would +say `projects: file://projects.yaml`. Current AgentV code and docs use +`$AGENTV_HOME/config.yaml` with a top-level `projects:` registry, so do not +migrate this as part of the coding-agent target-runtime work. If the team wants +the cleaner split, create a separate migration Bead with backwards-compatible +reading from current `projects:` locations and a clear write target. Promptfoo's comparable file-structure guidance is simpler: a main `promptfooconfig.yaml` commonly contains `providers`, `prompts`, `defaultTest`, From 5eff05b57f724f0ae1b96e1380b3625ab3a96eea Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:23:32 +0200 Subject: [PATCH 07/10] docs: make target runtime plan greenfield --- ...03-coding-agent-target-runtime-contract.md | 181 ++++++++---------- 1 file changed, 79 insertions(+), 102 deletions(-) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index 986b2f5b0..a7ddfa392 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -16,7 +16,7 @@ bead: av-y7eq - **Objective:** Make AgentV's coding-agent targets reliable by default while preserving rich transcripts and local "run the agent I use" workflows. - **Core decision:** Target authoring uses the compact shape - `label` + `provider` + `runtime` + `config`. SDK-backed coding-agent + `id` + `provider` + `runtime` + `config`. SDK-backed coding-agent providers, when retained, default to internal process isolation rather than importing risky agent SDKs in the AgentV orchestrator process. - **Primary Bead:** `av-y7eq` @@ -46,12 +46,11 @@ field. Runtime placement is a single concept: ```yaml targets: - - label: codex-local + - id: codex-local provider: codex-app-server runtime: host config: - command: codex - args: ["--config", "model_reasoning_effort=high"] + command: ["codex", "--config", "model_reasoning_effort=high"] model: gpt-5-codex ``` @@ -59,24 +58,23 @@ Expanded form is used only when needed: ```yaml targets: - - label: codex-clean + - id: codex-clean provider: codex-cli runtime: mode: profile home: .agentv/profiles/codex-clean config: - command: codex - args: ["--sandbox", "workspace-write"] + command: ["codex", "--sandbox", "workspace-write"] model: gpt-5-codex ``` ```yaml targets: - - label: pi-rpc-local + - id: pi-rpc-local provider: pi-rpc runtime: host config: - command: pi + command: ["pi"] model: gpt-5-codex ``` @@ -86,80 +84,64 @@ targets: | Field | Meaning | | --- | --- | -| `label` | Human and result identity for the target. Used by CLI selection, run artifacts, Dashboard, and comparisons. | +| `id` | Stable target identity. Used by CLI selection, run artifacts, Dashboard, and comparisons. | | `provider` | Adapter/control protocol kind: `codex-cli`, `codex-app-server`, `codex-sdk`, `pi-cli`, `pi-rpc`, `claude-cli`, `claude-sdk`, etc. | | `runtime` | Where and how the provider runs: `host`, `profile`, or `sandbox`. May be a string shorthand or an object with `mode`. | -| `config` | Provider-specific configuration. Keep `model`, `command`, `args`, timeouts, permission flags, and provider knobs here. | +| `config` | Provider-specific configuration. Keep `model`, `command`, timeouts, permission flags, and provider knobs here. | Do not add competing top-level fields such as `isolation`, `sandbox`, `install`, `container`, `environment`, or `profile`. Those details live under `runtime` or `config` only when a provider needs them. -### Preserve Existing AgentV Surface - -This plan is a targeted provider-boundary cleanup, not a rewrite from Promptfoo -or another framework. Preserve AgentV's current target capabilities unless an -implementation Bead explicitly removes one. - -For coding-agent providers, `config.command` is the executable or shim identity, -such as `codex`, `codex-personal`, `pi`, or an absolute binary path. It may -also be a non-empty argv array where the first token is the executable and the -remaining tokens are extra arguments. Normalize that form internally to -executable plus argv tokens. `config.args` remains the explicit argv token array -for extra provider-specific arguments. Keep the existing `executable`/`binary` -compatibility aliases and the existing `args`/`arguments` array aliases during -migration. Do not require users to pass shell-joined command strings for -coding-agent providers. - -If `config.command` is an argv array, reject simultaneous `config.args` unless -the provider defines an unambiguous merge order. This keeps command resolution -predictable and avoids hidden shell parsing. - -The generic `provider: cli` path is different: it currently uses a command -template string with placeholders such as `{PROMPT}` and healthcheck support. -Keep that compatibility path intact while adding coding-agent-specific runtime -boundaries. - -Also preserve the common and provider-specific knobs already used by AgentV: - -- common target behavior: `use_target`, `grader_target`, `fallback_targets`, - env interpolation, `cwd`, and `timeout_seconds` -- artifact/log behavior: `stream_log`, `log_dir`/`log_directory`, stdout/stderr - capture, raw protocol events, and partial logs on failure -- Codex knobs: `model`, `reasoning_effort`/`model_reasoning_effort`, - `model_verbosity`, `base_url`/`endpoint`, `api_key`, `api_format`, - `sandbox_mode`, `approval_policy`, and `system_prompt` -- Pi knobs: `subprovider`, `model`/`pi_model`, `api_key`, `base_url`/`endpoint`, - `tools`/`pi_tools`, `thinking`/`pi_thinking`, `args`, and `system_prompt` -- Claude knobs: `model`, `max_turns`, `max_budget_usd`, - `bypass_permissions`, and `system_prompt` -- Copilot knobs: `model`, custom provider settings, GitHub token/auth knobs, - ACP/prompt execution behavior, `args`, and `system_prompt` - -Where the new normalized contract uses nested `config`, implement migration by -normalizing the existing flat target fields into that internal shape. Do not -drop existing accepted YAML fields as a side effect of adding `runtime`. - -Do not promote orchestration scheduler fields into the new target runtime -contract. `workers`, `batch_requests`, and `subagent_mode_allowed` are existing -compatibility/runtime-policy fields, not part of the target's coding-agent -control boundary. Continue to handle them where AgentV already accepts them, -but prefer `--workers`, project `execution.workers`, `evaluate_options`, or -runtime policy for new scheduling behavior. - -`grader_target` is different. It is not a coding-agent runtime field, but the -concept is not redundant: coding-agent targets usually cannot act as structured -LLM graders, and AgentV workspaces often contain multiple LLM providers or -endpoints. AgentV still needs a default grader target selection. Preserve the -current resolution behavior while cleaning up provider runtimes: - -- CLI `--grader-target` is the strongest run-level override. -- Per-evaluator `target` remains the specific grader override. -- Target-level `grader_target` remains the compatibility/default grader for - that target until a clearer eval/project-level default is introduced. -- If a new canonical default is added later, prefer a grader/eval policy field - such as `default_grader_target` over putting grader selection inside - `runtime` or coding-agent provider `config`. +### Clean Contract + +This plan assumes a breaking cleanup. Do not preserve legacy target aliases or +compatibility-only fields in the new authored contract. + +For process-backed coding-agent providers, `config.command` is a non-empty argv +array. The first token is the executable or shim, such as `codex`, +`codex-personal`, `pi`, or an absolute binary path. Remaining tokens are extra +arguments. Do not add separate `args`, `arguments`, `executable`, or `binary` +fields to the new contract. + +```yaml +targets: file://targets.yaml +graders: file://graders.yaml + +defaults: + target: codex-local + grader: openai-grader +``` + +```yaml +# targets.yaml +- id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex", "--config", "model_reasoning_effort=high"] + model: gpt-5-codex +``` + +Keep provider-specific knobs under `config`, using one canonical name per +concept. Examples: + +- common target runtime config: `command`, `model`, `cwd`, `timeout_seconds`, + `system_prompt`, `stream_log`, `log_dir` +- Codex config: `reasoning_effort`, `model_verbosity`, `base_url`, `api_key`, + `api_format`, `sandbox_mode`, `approval_policy` +- Pi config: `subprovider`, `tools`, `thinking` +- Claude config: `max_turns`, `max_budget_usd`, `bypass_permissions` +- Copilot config: custom provider/auth settings and ACP/prompt mode settings + +Orchestration policy is not target runtime config. Keep `workers`, batching, +retry policy, and subagent dispatch under project/run policy such as +`execution`, not inside target definitions. + +Grader selection is a separate registry/default concern. Do not put +`grader_target` on targets in the clean schema. Use `defaults.grader` for the +project default, CLI `--grader` / `--grader-target` for run override, and +per-evaluator `target` for a specific grader override. Promptfoo's comparable mechanism is assertion/test grading provider selection: assertions can set a `provider`, tests/defaultTest can provide fallback grading @@ -229,9 +211,9 @@ value for that field. Greenfield examples: model: gpt-5-mini ``` -For compatibility with AgentV's existing standalone `targets.yaml` convention, -the loader can also accept wrapped forms such as `targets: [...]` and -`graders: [...]`, but the Promptfoo-like authored shape is the bare field value. +Do not accept wrapped forms such as `targets: [...]` inside a file already +loaded through `targets: file://targets.yaml`. The referenced file is the field +value. The global `$AGENTV_HOME/config.yaml` is different: it owns Dashboard/operator state such as the `projects:` registry. Do not use the existence of global @@ -241,11 +223,10 @@ project-local `.agentv/config.yaml`. Greenfield, the cleanest global shape would put Dashboard project registry state in `$AGENTV_HOME/projects.yaml` and leave `$AGENTV_HOME/config.yaml` for global settings. If using Promptfoo-style references, the global config would -say `projects: file://projects.yaml`. Current AgentV code and docs use -`$AGENTV_HOME/config.yaml` with a top-level `projects:` registry, so do not -migrate this as part of the coding-agent target-runtime work. If the team wants -the cleaner split, create a separate migration Bead with backwards-compatible -reading from current `projects:` locations and a clear write target. +say `projects: file://projects.yaml`. + +Do not add `dashboard.app_name` or other user-configurable AgentV branding to +the clean config contract. Dashboard product identity is not project policy. Promptfoo's comparable file-structure guidance is simpler: a main `promptfooconfig.yaml` commonly contains `providers`, `prompts`, `defaultTest`, @@ -274,7 +255,7 @@ and should default to internal process isolation: ```yaml targets: - - label: codex-sdk-isolated + - id: codex-sdk-isolated provider: codex-sdk runtime: host config: @@ -309,7 +290,7 @@ Failure mapping: | Source | Relevant pattern | AgentV decision | | --- | --- | --- | -| Promptfoo | Provider object uses `id`/`label`/`config`; Codex and Claude SDK providers put `model` in `config.model`; direct SDK adapters exist. | Keep `label`/`provider`/`config` ergonomics; keep `model` under `config`; do not make in-process SDK the default. | +| Promptfoo | Provider object uses `id` plus optional `label` and `config`; Codex and Claude SDK providers put `model` in `config.model`; direct SDK adapters exist. | Use `id` for stable identity, keep `provider`/`config` ergonomics, keep `model` under `config`, and do not make in-process SDK the default. | | OpenAI Symphony | Codex app-server subprocess with workspace/session orchestration, approval/sandbox policy, max-turn boundaries, and structured streaming/status. | Use `codex-app-server` as the preferred rich-control Codex provider. | | Kata Symphony | Pi is launched as `pi --mode rpc` locally or over SSH and controlled over stdio/RPC; workers must already have the runtime installed. | Add/prefer `pi-rpc` for rich Pi control; do not import Pi coding-agent SDK into AgentV's orchestrator. | | Vercel agent-eval | Installs agent CLIs inside ephemeral sandboxes and captures transcripts from CLI JSON/session logs. | `runtime.mode: sandbox` should support managed/pinned CLI install and transcript capture without host config bleed. | @@ -334,23 +315,21 @@ Use explicit provider kinds: Do not add `codex-rpc` unless Codex exposes a distinct RPC mode separate from app-server. For Codex, app-server is the protocol provider. -`config.command` is the executable or shim, not the provider identity. Extra -arguments may be supplied with `config.args` or, for compact argv-style input, -as a command array: +`config.command` is the argv array for the executable or shim. It is not the +provider identity: ```yaml targets: - - label: codex-personal + - id: codex-personal provider: codex-cli runtime: host config: - command: codex-personal - args: ["--model", "gpt-5-codex"] + command: ["codex-personal", "--model", "gpt-5-codex"] ``` ```yaml targets: - - label: codex-eng + - id: codex-eng provider: codex-cli runtime: host config: @@ -397,7 +376,7 @@ Keep provider names explicit by control boundary: - Add `runtime: host` shorthand and `runtime.mode: host | profile | sandbox`. - Keep `model` and `command` under `config`. -- Preserve `label` as target identity and `provider` as adapter/backend kind. +- Use `id` as target identity and `provider` as adapter/backend kind. - Reject invalid runtime modes with focused validation errors. - Document why `runtime` is the umbrella field. @@ -405,9 +384,8 @@ Keep provider names explicit by control boundary: - Split current ambiguous `codex` registry behavior into explicit `codex-cli`, `codex-app-server`, and `codex-sdk`. -- Make bare `codex`, if retained at all, alias to the chosen safe default - (`codex-app-server`) or reject it during the cleanup. It must not silently - select in-process SDK. +- Remove the bare `codex` provider name from the authored clean contract. Users + must choose `codex-cli`, `codex-app-server`, or `codex-sdk` explicitly. - Support `config.command` shims such as `codex-personal` and `codex-eng`. - Implement host/profile environment construction, including deliberate `HOME`, `CODEX_HOME`, temp dirs, and env allowlists for profile mode. @@ -443,7 +421,7 @@ Keep provider names explicit by control boundary: Every coding-agent provider must return or fail through a structured result envelope. AgentV must preserve: -- target label, provider kind, runtime mode, command, cwd, and model +- target id, provider kind, runtime mode, command, cwd, and model - stdout/stderr logs - structured event transcript when available - final assistant output @@ -457,10 +435,8 @@ crashes. ## Open Questions -- Whether to keep a bare `codex` alias at all. If kept, it should resolve to the - safe default, not SDK. - Whether to rename `pi-coding-agent` to `pi-sdk` during the major cleanup or - keep the existing provider name as an explicit legacy SDK provider. + replace the existing provider name with the shorter explicit SDK name. - Which sandbox substrate should be the first implementation target if existing AgentV runner support is insufficient. - How much transcript normalization belongs in provider adapters versus a shared @@ -469,7 +445,8 @@ crashes. ## Validation Plan - Schema tests for `runtime` shorthand/object forms and invalid values. -- Provider registry tests proving explicit provider names and safe aliases. +- Provider registry tests proving explicit provider names and no bare `codex` + fallback to SDK. - Codex CLI/app-server tests for command shims, host/profile env, timeout kill, nonzero exit, malformed output, and transcript capture. - Pi RPC tests with a fake `pi --mode rpc` process. From 4b5af83e4fd8f8a06abb2fc0cd726d83dce0ecb2 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:31:35 +0200 Subject: [PATCH 08/10] docs: prefer clean AgentV contracts over peer baggage --- .agents/product-boundary.md | 2 ++ AGENTS.md | 1 + 2 files changed, 3 insertions(+) diff --git a/.agents/product-boundary.md b/.agents/product-boundary.md index bb821851d..ec959166c 100644 --- a/.agents/product-boundary.md +++ b/.agents/product-boundary.md @@ -85,6 +85,8 @@ Research those references from local cloned repositories first when a clone is a Treat these as reference inputs, not dependencies. AgentV should adopt the shared lowest common denominator when it fits the repo-native artifact model, and document any intentional divergence in the relevant plan, ADR, or contract docs. +Do not copy another framework's schema baggage just because the framework is credible. When a peer contract carries historical constraints, overloaded field names, or compatibility aliases, prefer a cleaner AgentV contract if it preserves the core user need. Document the reason for diverging so future workers do not "realign" it back to the peer shape. For target/provider contracts, keep identity and backend/control boundary separate: use a stable AgentV `id` for the target registry key when `provider` already names the adapter/backend kind. Promptfoo's `label` is useful evidence but should not be copied as target identity merely because Promptfoo uses `id` for provider/backend specs. + ### 5. YAGNI - You Aren't Gonna Need It Do not build features until there is a concrete need. Start with the simplest version that satisfies current demand. diff --git a/AGENTS.md b/AGENTS.md index 6cb2ce7b6..66d672b35 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -26,6 +26,7 @@ Design guardrails: - Document composition patterns before inventing a new feature. - Match industry-standard lowest-common-denominator contracts when possible. - When designing AgentV contracts, check public reference standards such as Claude Skills, Vercel agent-eval, Hugging Face Datasets, and OpenInference before inventing AgentV-specific shapes. Use their shared lowest common denominator where it fits, and document any intentional divergence. +- Treat peer frameworks as evidence, not schema authority. Do not inherit baggage such as overloaded field names, compatibility aliases, or framework-specific historical constraints when AgentV can express a cleaner repo-native contract. Example: prefer `id` for stable AgentV target identity when `provider` already names the backend/control boundary, even if Promptfoo uses `label` because its `id` field is overloaded as a provider spec. - For peer-framework research, use local cloned repositories and DeepWiki MCP before broad web search. In this operator workspace, Promptfoo is cloned at `/home/entity/projects/promptfoo/promptfoo` and DeepEval is cloned at `/home/entity/projects/confident-ai/deepeval`; use DeepWiki repos `promptfoo/promptfoo` and `confident-ai/deepeval` for architecture-level orientation, then verify exact claims with `rg` and `git` in the local clone. If a public contract must be checked for currentness, use official docs and record the source URL or clone commit behind the conclusion. - Apply YAGNI aggressively and solve the current request with the smallest surface that works. - Keep extensions non-breaking unless a same-week unreleased surface should be hard-corrected. From c4131d4941e4953a0185c4f2fa53d32821511a7f Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:45:36 +0200 Subject: [PATCH 09/10] docs: clarify composable AgentV config contract --- ...03-coding-agent-target-runtime-contract.md | 133 ++++++++++-------- 1 file changed, 77 insertions(+), 56 deletions(-) diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index a7ddfa392..9bdffda4e 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -105,24 +105,25 @@ arguments. Do not add separate `args`, `arguments`, `executable`, or `binary` fields to the new contract. ```yaml -targets: file://targets.yaml -graders: file://graders.yaml +targets: + - id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex", "--config", "model_reasoning_effort=high"] + model: gpt-5-codex + +graders: + - id: openai-grader + provider: openai + config: + model: gpt-5-mini defaults: target: codex-local grader: openai-grader ``` -```yaml -# targets.yaml -- id: codex-local - provider: codex-app-server - runtime: host - config: - command: ["codex", "--config", "model_reasoning_effort=high"] - model: gpt-5-codex -``` - Keep provider-specific knobs under `config`, using one canonical name per concept. Examples: @@ -134,9 +135,11 @@ concept. Examples: - Claude config: `max_turns`, `max_budget_usd`, `bypass_permissions` - Copilot config: custom provider/auth settings and ACP/prompt mode settings -Orchestration policy is not target runtime config. Keep `workers`, batching, -retry policy, and subagent dispatch under project/run policy such as -`execution`, not inside target definitions. +Orchestration policy is not target runtime config. Keep general eval +concurrency, batching, retry policy, and subagent dispatch under project/run +policy such as `execution`, not inside target definitions. Use +`execution.max_concurrency` for general parallelism. Reserve `workers` for a +provider-specific config only when that provider truly uses worker processes. Grader selection is a separate registry/default concern. Do not put `grader_target` on targets in the clean schema. Use `defaults.grader` for the @@ -150,49 +153,70 @@ providers. It does not put grader selection in the target provider runtime. ### Project File Layout -Keep registries separate from policy: +Support composable/decomposable configuration. A single `.agentv/config.yaml` +and split files should be two authoring forms of the same config graph: ```text .agentv/ config.yaml - targets.yaml - graders.yaml ``` -Project-local `.agentv/config.yaml` should remain the portable project policy -file: defaults, `execution`, `eval_patterns`, `refs`, tags, result defaults, and -other run-level settings. It may point at the default target/grader by name, but -it should not become the registry that holds all target and grader definitions. -Following Promptfoo's modular-config idiom, use direct field references rather -than a named import table: +Project-local `.agentv/config.yaml` should be able to hold the full project +contract: targets, graders, defaults, `execution`, `eval_patterns`, refs, tags, +result defaults, and other run-level settings. This matches Promptfoo's primary +authoring model, where `promptfooconfig.yaml` commonly contains providers, +prompts, tests, defaultTest, and run options in one file. + +In other words, `.agentv/config.yaml` can technically contain every supported +field that an `eval.yaml` can contain. An eval file is a focused, shareable +slice of the same config graph, while `.agentv/config.yaml` is the project-root +manifest that can also carry project defaults and policy. Avoid creating two +competing top-level schemas for "project config" versus "eval config" unless a +field is intentionally scoped to one of those contexts. + +The `.agentv/` folder still matters even though Promptfoo does not have the same +project/global split. It gives AgentV a conventional project root for automatic +discovery, checked-in defaults, repo-local policy, result/artifact adjacency, +and composable config without requiring every command to pass explicit file +paths. The global AgentV config can provide operator/user defaults across +projects, while `.agentv/config.yaml` overrides or composes project-specific +targets, graders, tests, datasets, and execution policy. ```yaml # .agentv/config.yaml -targets: file://targets.yaml -graders: file://graders.yaml +targets: + - id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex"] + model: gpt-5-codex + +graders: + - id: openai-grader + provider: openai + config: + model: gpt-5-mini defaults: target: codex-local grader: openai-grader execution: - workers: 3 + max_concurrency: 3 ``` -Do not introduce a greenfield `files:` or `imports:` section for this unless -AgentV needs a capability that direct field references cannot express. -Promptfoo's pattern is `providers: file://configs/providers.yaml`, -`tests: file://tests/`, and `defaultTest: file://configs/default-test.yaml`; -the field being configured names the thing being loaded. - -`targets.yaml` should remain the registry of subjects under test. `graders.yaml` -should be the registry of reusable grading providers. This keeps target runtime -contracts reviewable, keeps grader credentials/endpoints separate from agent -runtimes, and matches AgentV's existing artifact model where run manifests carry -explicit `targets_path` and `graders_path` entries. +For larger projects, generated configs, or secret-splitting workflows, any +supported config field can be decomposed into a Promptfoo-style direct field +reference whose target file contains that field's value. Do not introduce a +greenfield `files:` or `imports:` section unless AgentV needs a capability that +direct field references cannot express. Promptfoo's pattern is `providers: +file://configs/providers.yaml`, `tests: file://tests/`, and `defaultTest: +file://configs/default-test.yaml`; the field being configured names the thing +being loaded. For Promptfoo-style field references, the referenced file should contain the -value for that field. Greenfield examples: +value for that field. Optional split-file examples: ```yaml # .agentv/targets.yaml @@ -212,29 +236,26 @@ value for that field. Greenfield examples: ``` Do not accept wrapped forms such as `targets: [...]` inside a file already -loaded through `targets: file://targets.yaml`. The referenced file is the field -value. +loaded through `targets: file://targets.yaml`, or `tests: [...]` inside a file +already loaded through `tests: file://tests.yaml`. The referenced file is the +field value. -The global `$AGENTV_HOME/config.yaml` is different: it owns Dashboard/operator -state such as the `projects:` registry. Do not use the existence of global -`projects:` as a reason to put project-local target/grader registries into -project-local `.agentv/config.yaml`. +The global `$AGENTV_HOME/config.yaml` can also use the same direct-field style, +including inline `projects:` for small installations or `projects: +file://projects.yaml` for larger registries. Do not add a separate import table +for global config either. -Greenfield, the cleanest global shape would put Dashboard project registry -state in `$AGENTV_HOME/projects.yaml` and leave `$AGENTV_HOME/config.yaml` for -global settings. If using Promptfoo-style references, the global config would -say `projects: file://projects.yaml`. +Greenfield, the cleanest default is one readable config graph. Inline and split +forms should normalize to the same internal shape. Do not add `dashboard.app_name` or other user-configurable AgentV branding to the clean config contract. Dashboard product identity is not project policy. -Promptfoo's comparable file-structure guidance is simpler: a main -`promptfooconfig.yaml` commonly contains `providers`, `prompts`, `defaultTest`, -and `tests`, while larger configs can reference external files such as provider -YAML with `file://...`. Promptfoo does not have AgentV's separate home-scoped -Dashboard project registry, so it is useful as a modular-config reference but -not a direct reason to collapse AgentV's project, target, and grader registries -into one file. +Promptfoo's comparable file-structure guidance is the closest reference here: +a main `promptfooconfig.yaml` commonly contains `providers`, `prompts`, +`defaultTest`, `tests`, and run options, while larger configs can reference +external files with `file://...`. AgentV should follow that authoring posture +while keeping cleaner AgentV field names. ### Runtime Modes From 7c486f06fee9ef3ca846b5ee9505b7ad8417f293 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 3 Jul 2026 14:49:11 +0200 Subject: [PATCH 10/10] docs: split config contract from target orchestration plan --- .../2026-07-03-agentv-config-contract.md | 232 ++++++++++++++++++ ...03-coding-agent-target-runtime-contract.md | 115 +-------- 2 files changed, 239 insertions(+), 108 deletions(-) create mode 100644 docs/plans/2026-07-03-agentv-config-contract.md diff --git a/docs/plans/2026-07-03-agentv-config-contract.md b/docs/plans/2026-07-03-agentv-config-contract.md new file mode 100644 index 000000000..41af7ddca --- /dev/null +++ b/docs/plans/2026-07-03-agentv-config-contract.md @@ -0,0 +1,232 @@ +--- +artifact_contract: ce-unified-plan/v1 +artifact_readiness: implementation-ready +product_contract_source: av-vrx8-research +execution: code +title: "AgentV composable config contract" +created_at: 2026-07-03 +type: feature +bead: av-y7eq.1 +--- + +# AgentV composable config contract + +## Goal Capsule + +- **Objective:** Give AgentV one clean config graph that works as project + manifest, eval definition, and composable split-file config without copying + Promptfoo's legacy naming baggage. +- **Core decision:** `.agentv/config.yaml` and `eval.yaml` use the same eval + config graph for eval-definition fields. `.agentv/config.yaml` is the + project-root manifest and can additionally carry project defaults and policy. +- **Primary Bead:** `av-y7eq.1` +- **Related Beads:** `av-y7eq`, `av-y7eq.8` +- **Non-goal:** Do not create separate competing schemas for project config and + eval config unless a field is intentionally scoped to one context. + +## Summary + +AgentV should have one composable/decomposable config graph. + +Small projects can keep everything in `.agentv/config.yaml`. Larger projects can +split any supported field into a `file://...` reference whose target file +contains that field's value. Both forms normalize to the same internal shape. + +This follows Promptfoo's useful authoring posture without copying all Promptfoo +field names. Promptfoo commonly lets `promptfooconfig.yaml` contain providers, +prompts, tests, defaultTest, and run options directly, and also lets those fields +point at files. AgentV should do the same at the graph level while preserving +AgentV terms such as targets, graders, projects, and run bundles. + +## Contract + +### Config Graph + +`.agentv/config.yaml` can technically contain every supported field that an +`eval.yaml` can contain: + +```yaml +targets: + - id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex", "app-server"] + model: gpt-5-codex + +graders: + - id: openai-grader + provider: openai + config: + model: gpt-5-mini + +tests: + - id: smoke + input: "Fix the failing test" + +defaults: + target: codex-local + grader: openai-grader + +execution: + max_concurrency: 3 +``` + +An `eval.yaml` is a focused, shareable slice of the same graph. It may contain +targets, graders, tests/evaluators, datasets, defaults, execution overrides, and +other eval-definition fields. `.agentv/config.yaml` is the project-root +manifest, so it may also own persistent project defaults and policy. + +### Scope Distinction + +The schemas should be shared where the field meaning is shared, but the file +roles are not identical: + +| File | Role | +| --- | --- | +| `.agentv/config.yaml` | Project-root manifest. Provides automatic discovery, checked-in defaults, repo-local policy, result/artifact adjacency, and composition against global defaults. | +| `eval.yaml` | Portable eval slice. Good for sharing, one-off suites, examples, or benchmark-specific overrides. | +| `$AGENTV_HOME/config.yaml` | User/operator defaults across projects. May include project registry, default result locations, or global provider defaults. | + +Do not pretend every field is valid in every context. Project identity, +Dashboard project registry, and persistent operator defaults belong in +`.agentv/config.yaml` or global config, not an eval slice. Eval-definition +fields should remain shared. + +### Field References + +Any supported config field can be decomposed into a direct `file://...` reference +whose target file contains that field's value: + +```yaml +targets: file://targets.yaml +graders: file://graders.yaml +tests: file://tests.yaml + +defaults: + target: codex-local + grader: openai-grader +``` + +Referenced array-valued fields contain a bare array: + +```yaml +# .agentv/targets.yaml +- id: codex-local + provider: codex-app-server + runtime: host + config: + command: ["codex"] +``` + +```yaml +# .agentv/tests.yaml +- id: smoke + input: "Fix the failing test" +``` + +Referenced object-valued fields contain a bare object: + +```yaml +# .agentv/defaults.yaml +target: codex-local +grader: openai-grader +``` + +Do not introduce a separate `files:` or `imports:` table unless AgentV needs a +capability direct field references cannot express. The field being configured +names the value being loaded. + +Do not accept wrapped forms such as `targets: [...]` inside a file already +loaded through `targets: file://targets.yaml`, or `tests: [...]` inside a file +loaded through `tests: file://tests.yaml`. The referenced file is the field +value. + +### Target And Grader Fields + +Target objects use: + +| Field | Meaning | +| --- | --- | +| `id` | Stable AgentV identity for selection, artifacts, dashboard, and comparisons. | +| `provider` | Adapter/control boundary such as `codex-cli`, `codex-app-server`, `pi-rpc`, `claude-cli`, or `openai`. | +| `runtime` | Coding-agent execution placement: `host`, `profile`, or `sandbox`. | +| `config` | Provider-specific configuration such as `model`, `command`, timeouts, env, protocol, and provider knobs. | + +Use `defaults.target` and `defaults.grader` for run defaults. Do not put +`grader_target` on targets. + +Use `config.command` as a non-empty argv array for process-backed providers: + +```yaml +config: + command: ["codex-personal", "app-server"] +``` + +Do not add parallel `args`, `arguments`, `executable`, or `binary` fields in the +authored contract. + +### Execution Policy + +Use `execution.max_concurrency` for general eval parallelism: + +```yaml +execution: + max_concurrency: 3 +``` + +Promptfoo evidence checked on 2026-07-03: + +- DeepWiki for `promptfoo/promptfoo` reports general concurrency through + `evaluateOptions.maxConcurrency`, `commandLineOptions.maxConcurrency`, and + CLI `--max-concurrency` / `-j`. +- Local Promptfoo clone + `/home/entity/projects/promptfoo/promptfoo` at + `6bfc5a0c7f16f9c4717ac731d276b578e63d0769` verifies that `src/node/doEval.ts` + resolves `maxConcurrency` from CLI, `commandLineOptions`, `evaluateOptions`, + then default, and that Python `config.workers` is provider-specific in + `src/providers/pythonCompletion.ts`. + +Therefore, `workers` should not be AgentV's general run-policy field. Reserve it +for provider-specific config only when a provider truly manages worker +processes. + +## Rejected Baggage + +Do not include these in the greenfield authored contract: + +- `label` or `name` as target identity. +- bare ambiguous provider aliases such as `provider: codex`. +- target-level `grader_target`. +- user-configurable `dashboard.app_name`. +- process field variants `executable`, `binary`, `args`, `arguments`. +- target-level `workers`, batching, retry, or subagent-dispatch controls. +- compatibility-only wrapper files for direct field refs. + +## Implementation Notes + +- Implement refs as field-level resolution before schema normalization. +- Keep wire-format keys `snake_case`; translate to internal TypeScript + `camelCase` only at boundaries. +- Ensure inline and split forms produce identical normalized objects. +- Validation errors should point to the authored path, including the referenced + file path when applicable. +- Public docs should show both inline and split-file forms, without presenting + split files as mandatory. +- Migration text is unnecessary unless a later decision requires backward + compatibility. + +## Acceptance Criteria + +- `.agentv/config.yaml` can inline targets, graders, tests/evaluators, defaults, + and execution policy. +- `eval.yaml` can contain the same eval-definition fields and normalize through + the same schema path. +- Any supported field can be a `file://...` ref whose file contains that field's + value. +- Inline and split forms normalize identically. +- Context-scoped fields are validated according to file role, so project/global + identity and registry fields do not accidentally become portable eval-slice + fields. +- `execution.max_concurrency` is the general concurrency field. +- Removed Promptfoo/legacy baggage fields are rejected with focused errors. diff --git a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md index 9bdffda4e..c61d9c0e5 100644 --- a/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md +++ b/docs/plans/2026-07-03-coding-agent-target-runtime-contract.md @@ -20,8 +20,9 @@ bead: av-y7eq providers, when retained, default to internal process isolation rather than importing risky agent SDKs in the AgentV orchestrator process. - **Primary Bead:** `av-y7eq` -- **Implementation Beads:** `av-y7eq.1` through `av-y7eq.5`; existing SDK - subprocess follow-up `av-57i` / `av-57i.1`. +- **Implementation Beads:** `av-y7eq.2` through `av-y7eq.7`; config contract + prerequisite `av-y7eq.1`; existing SDK subprocess follow-up `av-57i` / + `av-57i.1`. - **Non-goal:** Do not replace AgentV with Promptfoo, Symphony, Kata, Margin, or Vercel agent-eval. Borrow their proven boundaries and keep AgentV's repo-native run bundle model. @@ -78,6 +79,10 @@ targets: model: gpt-5-codex ``` +For config graph, file layout, `eval.yaml` relationship, and field-level +`file://...` references, see +[AgentV composable config contract](2026-07-03-agentv-config-contract.md). + ## Product Contract ### Stable Fields @@ -151,112 +156,6 @@ assertions can set a `provider`, tests/defaultTest can provide fallback grading providers, and model-graded matchers fall back to type-specific default grading providers. It does not put grader selection in the target provider runtime. -### Project File Layout - -Support composable/decomposable configuration. A single `.agentv/config.yaml` -and split files should be two authoring forms of the same config graph: - -```text -.agentv/ - config.yaml -``` - -Project-local `.agentv/config.yaml` should be able to hold the full project -contract: targets, graders, defaults, `execution`, `eval_patterns`, refs, tags, -result defaults, and other run-level settings. This matches Promptfoo's primary -authoring model, where `promptfooconfig.yaml` commonly contains providers, -prompts, tests, defaultTest, and run options in one file. - -In other words, `.agentv/config.yaml` can technically contain every supported -field that an `eval.yaml` can contain. An eval file is a focused, shareable -slice of the same config graph, while `.agentv/config.yaml` is the project-root -manifest that can also carry project defaults and policy. Avoid creating two -competing top-level schemas for "project config" versus "eval config" unless a -field is intentionally scoped to one of those contexts. - -The `.agentv/` folder still matters even though Promptfoo does not have the same -project/global split. It gives AgentV a conventional project root for automatic -discovery, checked-in defaults, repo-local policy, result/artifact adjacency, -and composable config without requiring every command to pass explicit file -paths. The global AgentV config can provide operator/user defaults across -projects, while `.agentv/config.yaml` overrides or composes project-specific -targets, graders, tests, datasets, and execution policy. - -```yaml -# .agentv/config.yaml -targets: - - id: codex-local - provider: codex-app-server - runtime: host - config: - command: ["codex"] - model: gpt-5-codex - -graders: - - id: openai-grader - provider: openai - config: - model: gpt-5-mini - -defaults: - target: codex-local - grader: openai-grader - -execution: - max_concurrency: 3 -``` - -For larger projects, generated configs, or secret-splitting workflows, any -supported config field can be decomposed into a Promptfoo-style direct field -reference whose target file contains that field's value. Do not introduce a -greenfield `files:` or `imports:` section unless AgentV needs a capability that -direct field references cannot express. Promptfoo's pattern is `providers: -file://configs/providers.yaml`, `tests: file://tests/`, and `defaultTest: -file://configs/default-test.yaml`; the field being configured names the thing -being loaded. - -For Promptfoo-style field references, the referenced file should contain the -value for that field. Optional split-file examples: - -```yaml -# .agentv/targets.yaml -- id: codex-local - provider: codex-app-server - runtime: host - config: - command: ["codex"] -``` - -```yaml -# .agentv/graders.yaml -- id: openai-grader - provider: openai - config: - model: gpt-5-mini -``` - -Do not accept wrapped forms such as `targets: [...]` inside a file already -loaded through `targets: file://targets.yaml`, or `tests: [...]` inside a file -already loaded through `tests: file://tests.yaml`. The referenced file is the -field value. - -The global `$AGENTV_HOME/config.yaml` can also use the same direct-field style, -including inline `projects:` for small installations or `projects: -file://projects.yaml` for larger registries. Do not add a separate import table -for global config either. - -Greenfield, the cleanest default is one readable config graph. Inline and split -forms should normalize to the same internal shape. - -Do not add `dashboard.app_name` or other user-configurable AgentV branding to -the clean config contract. Dashboard product identity is not project policy. - -Promptfoo's comparable file-structure guidance is the closest reference here: -a main `promptfooconfig.yaml` commonly contains `providers`, `prompts`, -`defaultTest`, `tests`, and run options, while larger configs can reference -external files with `file://...`. AgentV should follow that authoring posture -while keeping cleaner AgentV field names. - ### Runtime Modes | Runtime | Boundary | Use case |