Skip to content

feat(results): implement ADR-0017 bundle content#1623

Merged
christso merged 1 commit into
mainfrom
feat/output-content-contract
Jul 3, 2026
Merged

feat(results): implement ADR-0017 bundle content#1623
christso merged 1 commit into
mainfrom
feat/output-content-contract

Conversation

@christso

@christso christso commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements the ADR-0017 run-bundle output CONTENT contract for local run bundles, result readers, docs, and tests. This PR intentionally preserves the av-i3zw scope boundary: it does not remove or redesign the published Git results branch top-level wrapper beyond reader compatibility for the new bundle contents.

Output format decision

Recommended canonical local run-bundle tree:

.agentv/results/
  .indexes/
    runs.jsonl
    cases.jsonl
  <run_id>/
    summary.json
    .internal/
      index.jsonl
      progress.json
      events.jsonl
      bundle.json
    <case_id>/
      summary.json
      test/
      sample-1/
        result.json
        grading.json
        metrics.json
        transcript.json
        transcript-raw.jsonl
        outputs/
      sample-2/
        result.json
        grading.json
        metrics.json
        transcript.json
        transcript-raw.jsonl
        outputs/

Recommended published Git results branch tree: the durable run bundle shape should be mirrored at the results branch run root. This PR keeps av-i3zw's separation intact: av-i3zw owns the published branch wrapper decision and any removal/reader changes for a legacy top-level runs/ wrapper. This PR makes the content under each run bundle canonical regardless of local or published transport.

Source of truth vs projections:

  • Source of truth: <run_id>/summary.json, <run_id>/.internal/index.jsonl, per-case summary.json, per-sample result.json, grading.json, metrics.json, transcripts, outputs, and test bundle sidecars.
  • Rebuildable projections: .indexes/runs.jsonl, .indexes/cases.jsonl, local caches, Dashboard views, SQLite/search/report/export projections.

Naming decisions:

  • Per-run index: .internal/index.jsonl, referenced by summary.json.index_path.
  • Cross-run indexes: .indexes/runs.jsonl and .indexes/cases.jsonl; do not move these into .internal/ and do not rename .indexes to .index.
  • Summary: run root summary.json is the jq-friendly aggregate with run id, status breakdown, counts, pass@k, usage/cost/tokens, infra failures, per-case and per-instance outcomes.
  • Grading: grading.json owns assertion/rubric verdict evidence.
  • Metrics: metrics.json owns top-level duration, tokens, cost, execution, and trajectory; new writers do not emit timing.json, timing_path, metrics.timing, or source_artifacts.timing_path.
  • Transcripts: normalized transcript.json plus raw evidence transcript-raw.jsonl stay in each sample-N/ folder.
  • Samples/retries: repeated outputs use sample-N/ folders plus explicit sample_index and retry_index row metadata; folder names are not semantic comparison dimensions.

Intentional divergences from references:

  • Margin evals (/home/entity/projects/Margin-Lab/evals at 53fb2fd080689efaf7934573d8759d14fc1043e4) uses results.json, internal/, instances/<id>/result.json, and a richer operational RunStore. AgentV keeps ADR-0017 summary.json, hidden .internal/, and case/sample sidecars to keep the bundle small and AI-readable without importing Margin's store abstraction.
  • Vercel agent-eval (/home/entity/projects/vercel-labs/agent-eval at a9dcc9a8c53dbc22ececc967ded7ab3963f18e67) writes per-eval summary.json and run-N/ attempts with transcripts. AgentV uses sample-N/ because run already means the top-level run bundle.
  • Agent Skills evaluating-skills (https://raw.githubusercontent.com/agentskills/agentskills/main/docs/skill-creation/evaluating-skills.mdx, accessed 2026-07-03) separates grading.json and timing.json. AgentV keeps the grading evidence idea but merges timing into metrics.json so one usage/behavior file owns duration, tokens, cost, execution, and trajectory.

Bead/ADR wording: no ADR rename is needed. The Bead wording should record this decision explicitly, including that branch-wrapper changes remain av-i3zw-owned and that live dogfood is currently blocked by provider connectivity.

Implementation

  • Writes .internal/index.jsonl and rich summary.json with index_path, status/count/pass@k/usage/infra failure/case/instance data.
  • Writes sample-N/ folders with result.json, grading.json, metrics.json, transcripts, outputs, and test bundle links.
  • Adds rebuildable .indexes/runs.jsonl and .indexes/cases.jsonl projections.
  • Updates result readers, validation, Dashboard/results serving, pipeline outputs, docs, and migration references to consume the new content contract while tolerating old bundles where needed.

Verification

  • bun install
  • bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
  • bun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.ts
  • bun run build
  • git diff --check
  • Simple dogfood: mock CLI eval under a temporary canonical .agentv/results/smoke-run output wrote .internal/index.jsonl, sample-1/metrics.json, .indexes/runs.jsonl, .indexes/cases.jsonl, and did not write root index.jsonl or timing.json.

Live dogfood blocker

Copied .env from the primary checkout at /home/entity/projects/EntityProcess/agentv/.env. The quickstart live target env vars were unavailable, so I created a temporary target from OPENAI_ENDPOINT, OPENAI_API_KEY, and OPENAI_MODEL. The live provider attempt reached execution but failed with pi-ai call failed: Connection error. No successful live provider plus real grader evidence was produced; this is recorded as the exact blocker.

Review follow-up

Subagent review found two issues before merge: --rerun-failed <run>/.internal/index.jsonl resolved to <run>/.internal, and a few next-doc examples still referenced root index.jsonl or duplicated metrics.json. Commit e367d698 fixes both and adds/updates CLI integration coverage for canonical .internal/index.jsonl output/rerun paths.

Additional verification after review fixes:

  • bun test ./apps/cli/test/eval.integration.test.ts
  • bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
  • bun run build
  • git diff --check

Final review follow-up

Second subagent pass confirmed the rerun fix and found remaining root-index examples in compare.mdx and wip-checkpoints.mdx. Commit 10c1052f updates those examples to .internal/index.jsonl; git diff --check remains clean.

Rebase follow-up

Rebased onto origin/main and resolved results-repo reader/test conflicts by keeping the branch-root results layout from main while preserving this PR's .internal/index.jsonl content contract. Current commit: b2fea3e7.

Post-rebase verification:

  • bun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.ts
  • bun run build
  • bun test ./apps/cli/test/eval.integration.test.ts
  • bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts
  • git diff --check

Lint follow-up

CI lint failed on formatting. Commit 599465c1 applies Biome formatting to the affected files.

Additional verification:

  • bun run lint
  • bun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.ts

Update 2026-07-03 after CI fallout fixes (1d327bef):

  • Kept the ADR-0017 content contract after rebase onto origin/main branch-root results publishing work: local/published bundle content still uses .internal/index.jsonl, .indexes/, sample-N, and metrics.json; av-i3zw remains owner of the published branch wrapper/top-level layout.
  • Fixed results export/report readers so .internal/index.jsonl resolves sidecars from the run root, not .internal.
  • Updated grade-prepared, rerun, report, pipeline, aggregate, run-cache, export, and programmatic/orchestrator tests for .internal/index.jsonl, sample-1, metrics_path, and the new summary/metrics contract.
  • Added guard so rebuildable .indexes projections are emitted only for canonical .agentv/results roots, avoiding arbitrary programmatic output directories scanning their temp parent.

Latest local verification:

  • bun run lint
  • bun run typecheck
  • bun run test
  • bun run build (serial; an earlier parallel build/test run had a dist race and was rerun serially successfully)
  • Focused coverage: results repo/cache, eval integration, artifact-writer/result-layout/validate, results export/export-e2e, grade-prepared, report, rerun, aggregate, run-cache, bundle, pipeline bench/e2e, orchestrator, and programmatic API tests.
  • Simple dogfood remains passing: mock CLI eval under temp canonical .agentv/results/smoke-run produced .internal/index.jsonl, sample-1/metrics.json, .indexes/runs.jsonl, .indexes/cases.jsonl, no root index.jsonl, and no timing.json.
  • Live provider dogfood remains blocked after copying .env from the primary checkout: quickstart LOCAL_OPENAI_PROXY_* values are absent, and a temp target using OPENAI_ENDPOINT/OPENAI_API_KEY/OPENAI_MODEL reached execution but failed pi-ai call failed: Connection error.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1d327be
Status: ✅  Deploy successful!
Preview URL: https://3cb90630.agentv.pages.dev
Branch Preview URL: https://feat-output-content-contract.agentv.pages.dev

View logs

@christso christso force-pushed the feat/output-content-contract branch 4 times, most recently from b2fea3e to 599465c Compare July 3, 2026 13:37
@christso christso force-pushed the feat/output-content-contract branch from 599465c to 1d327be Compare July 3, 2026 14:13
@christso christso marked this pull request as ready for review July 3, 2026 14:15
@christso christso merged commit ad619ef into main Jul 3, 2026
8 checks passed
@christso christso deleted the feat/output-content-contract branch July 3, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant