feat(results): implement ADR-0017 bundle content#1623
Merged
Conversation
Deploying agentv with
|
| Latest commit: |
1d327be
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3cb90630.agentv.pages.dev |
| Branch Preview URL: | https://feat-output-content-contract.agentv.pages.dev |
b2fea3e to
599465c
Compare
599465c to
1d327be
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the ADR-0017 run-bundle output CONTENT contract for local run bundles, result readers, docs, and tests. This PR intentionally preserves the av-i3zw scope boundary: it does not remove or redesign the published Git results branch top-level wrapper beyond reader compatibility for the new bundle contents.
Output format decision
Recommended canonical local run-bundle tree:
Recommended published Git results branch tree: the durable run bundle shape should be mirrored at the results branch run root. This PR keeps av-i3zw's separation intact: av-i3zw owns the published branch wrapper decision and any removal/reader changes for a legacy top-level
runs/wrapper. This PR makes the content under each run bundle canonical regardless of local or published transport.Source of truth vs projections:
<run_id>/summary.json,<run_id>/.internal/index.jsonl, per-casesummary.json, per-sampleresult.json,grading.json,metrics.json, transcripts, outputs, and test bundle sidecars..indexes/runs.jsonl,.indexes/cases.jsonl, local caches, Dashboard views, SQLite/search/report/export projections.Naming decisions:
.internal/index.jsonl, referenced bysummary.json.index_path..indexes/runs.jsonland.indexes/cases.jsonl; do not move these into.internal/and do not rename.indexesto.index.summary.jsonis the jq-friendly aggregate with run id, status breakdown, counts, pass@k, usage/cost/tokens, infra failures, per-case and per-instance outcomes.grading.jsonowns assertion/rubric verdict evidence.metrics.jsonowns top-levelduration,tokens,cost,execution, andtrajectory; new writers do not emittiming.json,timing_path,metrics.timing, orsource_artifacts.timing_path.transcript.jsonplus raw evidencetranscript-raw.jsonlstay in eachsample-N/folder.sample-N/folders plus explicitsample_indexandretry_indexrow metadata; folder names are not semantic comparison dimensions.Intentional divergences from references:
/home/entity/projects/Margin-Lab/evalsat53fb2fd080689efaf7934573d8759d14fc1043e4) usesresults.json,internal/,instances/<id>/result.json, and a richer operational RunStore. AgentV keeps ADR-0017summary.json, hidden.internal/, and case/sample sidecars to keep the bundle small and AI-readable without importing Margin's store abstraction./home/entity/projects/vercel-labs/agent-evalata9dcc9a8c53dbc22ececc967ded7ab3963f18e67) writes per-evalsummary.jsonandrun-N/attempts with transcripts. AgentV usessample-N/becauserunalready means the top-level run bundle.https://raw.githubusercontent.com/agentskills/agentskills/main/docs/skill-creation/evaluating-skills.mdx, accessed 2026-07-03) separatesgrading.jsonandtiming.json. AgentV keeps the grading evidence idea but merges timing intometrics.jsonso one usage/behavior file owns duration, tokens, cost, execution, and trajectory.Bead/ADR wording: no ADR rename is needed. The Bead wording should record this decision explicitly, including that branch-wrapper changes remain av-i3zw-owned and that live dogfood is currently blocked by provider connectivity.
Implementation
.internal/index.jsonland richsummary.jsonwithindex_path, status/count/pass@k/usage/infra failure/case/instance data.sample-N/folders withresult.json,grading.json,metrics.json, transcripts, outputs, and test bundle links..indexes/runs.jsonland.indexes/cases.jsonlprojections.Verification
bun installbun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.tsbun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.tsbun run buildgit diff --check.agentv/results/smoke-runoutput wrote.internal/index.jsonl,sample-1/metrics.json,.indexes/runs.jsonl,.indexes/cases.jsonl, and did not write rootindex.jsonlortiming.json.Live dogfood blocker
Copied
.envfrom the primary checkout at/home/entity/projects/EntityProcess/agentv/.env. The quickstart live target env vars were unavailable, so I created a temporary target fromOPENAI_ENDPOINT,OPENAI_API_KEY, andOPENAI_MODEL. The live provider attempt reached execution but failed withpi-ai call failed: Connection error.No successful live provider plus real grader evidence was produced; this is recorded as the exact blocker.Review follow-up
Subagent review found two issues before merge:
--rerun-failed <run>/.internal/index.jsonlresolved to<run>/.internal, and a few next-doc examples still referenced rootindex.jsonlor duplicatedmetrics.json. Commite367d698fixes both and adds/updates CLI integration coverage for canonical.internal/index.jsonloutput/rerun paths.Additional verification after review fixes:
bun test ./apps/cli/test/eval.integration.test.tsbun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.tsbun run buildgit diff --checkFinal review follow-up
Second subagent pass confirmed the rerun fix and found remaining root-index examples in
compare.mdxandwip-checkpoints.mdx. Commit10c1052fupdates those examples to.internal/index.jsonl;git diff --checkremains clean.Rebase follow-up
Rebased onto
origin/mainand resolved results-repo reader/test conflicts by keeping the branch-root results layout from main while preserving this PR's.internal/index.jsonlcontent contract. Current commit:b2fea3e7.Post-rebase verification:
bun test ./packages/core/test/evaluation/results-repo.test.ts ./packages/core/src/evaluation/results-repo-cache.test.tsbun run buildbun test ./apps/cli/test/eval.integration.test.tsbun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.tsgit diff --checkLint follow-up
CI lint failed on formatting. Commit
599465c1applies Biome formatting to the affected files.Additional verification:
bun run lintbun test ./apps/cli/test/commands/eval/artifact-writer.test.ts ./apps/cli/test/commands/eval/result-layout.test.ts ./apps/cli/test/commands/results/validate.test.tsUpdate 2026-07-03 after CI fallout fixes (
1d327bef):origin/mainbranch-root results publishing work: local/published bundle content still uses.internal/index.jsonl,.indexes/,sample-N, andmetrics.json; av-i3zw remains owner of the published branch wrapper/top-level layout..internal/index.jsonlresolves sidecars from the run root, not.internal..internal/index.jsonl,sample-1,metrics_path, and the new summary/metrics contract..indexesprojections are emitted only for canonical.agentv/resultsroots, avoiding arbitrary programmatic output directories scanning their temp parent.Latest local verification:
bun run lintbun run typecheckbun run testbun run build(serial; an earlier parallel build/test run had adistrace and was rerun serially successfully).agentv/results/smoke-runproduced.internal/index.jsonl,sample-1/metrics.json,.indexes/runs.jsonl,.indexes/cases.jsonl, no rootindex.jsonl, and notiming.json..envfrom the primary checkout: quickstartLOCAL_OPENAI_PROXY_*values are absent, and a temp target usingOPENAI_ENDPOINT/OPENAI_API_KEY/OPENAI_MODELreached execution but failedpi-ai call failed: Connection error.