feat(results): implement ADR-0017 bundle content by christso · Pull Request #1623 · EntityProcess/agentv

christso · 2026-07-03T13:08:10Z

Summary

Implements the ADR-0017 run-bundle output CONTENT contract for local run bundles, result readers, docs, and tests. This PR intentionally preserves the av-i3zw scope boundary: it does not remove or redesign the published Git results branch top-level wrapper beyond reader compatibility for the new bundle contents.

Output format decision

Recommended canonical local run-bundle tree:

.agentv/results/
  .indexes/
    runs.jsonl
    cases.jsonl
  <run_id>/
    summary.json
    .internal/
      index.jsonl
      progress.json
      events.jsonl
      bundle.json
    <case_id>/
      summary.json
      test/
      sample-1/
        result.json
        grading.json
        metrics.json
        transcript.json
        transcript-raw.jsonl
        outputs/
      sample-2/
        result.json
        grading.json
        metrics.json
        transcript.json
        transcript-raw.jsonl
        outputs/

Recommended published Git results branch tree: the durable run bundle shape should be mirrored at the results branch run root. This PR keeps av-i3zw's separation intact: av-i3zw owns the published branch wrapper decision and any removal/reader changes for a legacy top-level runs/ wrapper. This PR makes the content under each run bundle canonical regardless of local or published transport.

Source of truth vs projections:

Source of truth: <run_id>/summary.json, <run_id>/.internal/index.jsonl, per-case summary.json, per-sample result.json, grading.json, metrics.json, transcripts, outputs, and test bundle sidecars.
Rebuildable projections: .indexes/runs.jsonl, .indexes/cases.jsonl, local caches, Dashboard views, SQLite/search/report/export projections.

Naming decisions:

Per-run index: .internal/index.jsonl, referenced by summary.json.index_path.
Cross-run indexes: .indexes/runs.jsonl and .indexes/cases.jsonl; do not move these into .internal/ and do not rename .indexes to .index.
Summary: run root summary.json is the jq-friendly aggregate with run id, status breakdown, counts, pass@k, usage/cost/tokens, infra failures, per-case and per-instance outcomes.
Grading: grading.json owns assertion/rubric verdict evidence.
Metrics: metrics.json owns top-level duration, tokens, cost, execution, and trajectory; new writers do not emit timing.json, timing_path, metrics.timing, or source_artifacts.timing_path.
Transcripts: normalized transcript.json plus raw evidence transcript-raw.jsonl stay in each sample-N/ folder.
Samples/retries: repeated outputs use sample-N/ folders plus explicit sample_index and retry_index row metadata; folder names are not semantic comparison dimensions.

Intentional divergences from references:

Margin evals (/home/entity/projects/Margin-Lab/evals at 53fb2fd080689efaf7934573d8759d14fc1043e4) uses results.json, internal/, instances/<id>/result.json, and a richer operational RunStore. AgentV keeps ADR-0017 summary.json, hidden .internal/, and case/sample sidecars to keep the bundle small and AI-readable without importing Margin's store abstraction.
Vercel agent-eval (/home/entity/projects/vercel-labs/agent-eval at a9dcc9a8c53dbc22ececc967ded7ab3963f18e67) writes per-eval summary.json and run-N/ attempts with transcripts. AgentV uses sample-N/ because run already means the top-level run bundle.
Agent Skills evaluating-skills (https://raw.githubusercontent.com/agentskills/agentskills/main/docs/skill-creation/evaluating-skills.mdx, accessed 2026-07-03) separates grading.json and timing.json. AgentV keeps the grading evidence idea but merges timing into metrics.json so one usage/behavior file owns duration, tokens, cost, execution, and trajectory.

Bead/ADR wording: no ADR rename is needed. The Bead wording should record this decision explicitly, including that branch-wrapper changes remain av-i3zw-owned and that live dogfood is currently blocked by provider connectivity.

Implementation

Writes .internal/index.jsonl and rich summary.json with index_path, status/count/pass@k/usage/infra failure/case/instance data.
Writes sample-N/ folders with result.json, grading.json, metrics.json, transcripts, outputs, and test bundle links.
Adds rebuildable .indexes/runs.jsonl and .indexes/cases.jsonl projections.
Updates result readers, validation, Dashboard/results serving, pipeline outputs, docs, and migration references to consume the new content contract while tolerating old bundles where needed.

Verification

bun install
bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
bun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.ts
bun run build
git diff --check
Simple dogfood: mock CLI eval under a temporary canonical .agentv/results/smoke-run output wrote .internal/index.jsonl, sample-1/metrics.json, .indexes/runs.jsonl, .indexes/cases.jsonl, and did not write root index.jsonl or timing.json.

Live dogfood blocker

Copied .env from the primary checkout at /home/entity/projects/EntityProcess/agentv/.env. The quickstart live target env vars were unavailable, so I created a temporary target from OPENAI_ENDPOINT, OPENAI_API_KEY, and OPENAI_MODEL. The live provider attempt reached execution but failed with pi-ai call failed: Connection error. No successful live provider plus real grader evidence was produced; this is recorded as the exact blocker.

Review follow-up

Subagent review found two issues before merge: --rerun-failed <run>/.internal/index.jsonl resolved to <run>/.internal, and a few next-doc examples still referenced root index.jsonl or duplicated metrics.json. Commit e367d698 fixes both and adds/updates CLI integration coverage for canonical .internal/index.jsonl output/rerun paths.

Additional verification after review fixes:

bun test ./apps/cli/test/eval.integration.test.ts
bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
bun run build
git diff --check

Final review follow-up

Second subagent pass confirmed the rerun fix and found remaining root-index examples in compare.mdx and wip-checkpoints.mdx. Commit 10c1052f updates those examples to .internal/index.jsonl; git diff --check remains clean.

Rebase follow-up

Rebased onto origin/main and resolved results-repo reader/test conflicts by keeping the branch-root results layout from main while preserving this PR's .internal/index.jsonl content contract. Current commit: b2fea3e7.

Post-rebase verification:

bun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.ts
bun run build
bun test ./apps/cli/test/eval.integration.test.ts
bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
git diff --check

Lint follow-up

CI lint failed on formatting. Commit 599465c1 applies Biome formatting to the affected files.

Additional verification:

bun run lint
bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts

Update 2026-07-03 after CI fallout fixes (1d327bef):

Kept the ADR-0017 content contract after rebase onto origin/main branch-root results publishing work: local/published bundle content still uses .internal/index.jsonl, .indexes/, sample-N, and metrics.json; av-i3zw remains owner of the published branch wrapper/top-level layout.
Fixed results export/report readers so .internal/index.jsonl resolves sidecars from the run root, not .internal.
Updated grade-prepared, rerun, report, pipeline, aggregate, run-cache, export, and programmatic/orchestrator tests for .internal/index.jsonl, sample-1, metrics_path, and the new summary/metrics contract.
Added guard so rebuildable .indexes projections are emitted only for canonical .agentv/results roots, avoiding arbitrary programmatic output directories scanning their temp parent.

Latest local verification:

bun run lint
bun run typecheck
bun run test
bun run build (serial; an earlier parallel build/test run had a dist race and was rerun serially successfully)
Focused coverage: results repo/cache, eval integration, artifact-writer/result-layout/validate, results export/export-e2e, grade-prepared, report, rerun, aggregate, run-cache, bundle, pipeline bench/e2e, orchestrator, and programmatic API tests.
Simple dogfood remains passing: mock CLI eval under temp canonical .agentv/results/smoke-run produced .internal/index.jsonl, sample-1/metrics.json, .indexes/runs.jsonl, .indexes/cases.jsonl, no root index.jsonl, and no timing.json.
Live provider dogfood remains blocked after copying .env from the primary checkout: quickstart LOCAL_OPENAI_PROXY_* values are absent, and a temp target using OPENAI_ENDPOINT/OPENAI_API_KEY/OPENAI_MODEL reached execution but failed pi-ai call failed: Connection error.

cloudflare-workers-and-pages · 2026-07-03T13:08:39Z

Deploying agentv with Cloudflare Pages

Latest commit:	`1d327be`
Status:	✅ Deploy successful!
Preview URL:	https://3cb90630.agentv.pages.dev
Branch Preview URL:	https://feat-output-content-contract.agentv.pages.dev

View logs

christso force-pushed the feat/output-content-contract branch 4 times, most recently from b2fea3e to 599465c Compare July 3, 2026 13:37

feat(results): implement ADR-0017 bundle content

1d327be

christso force-pushed the feat/output-content-contract branch from 599465c to 1d327be Compare July 3, 2026 14:13

christso marked this pull request as ready for review July 3, 2026 14:15

christso merged commit ad619ef into main Jul 3, 2026
8 checks passed

christso deleted the feat/output-content-contract branch July 3, 2026 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(results): implement ADR-0017 bundle content#1623

feat(results): implement ADR-0017 bundle content#1623
christso merged 1 commit into
mainfrom
feat/output-content-contract

christso commented Jul 3, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Output format decision

Implementation

Verification

Live dogfood blocker

Review follow-up

Final review follow-up

Rebase follow-up

Lint follow-up

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 3, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 3, 2026 •

edited

Loading