feat: level-3 program graphs — native CFG/PDG/SDG and backward slicing (-a 3)#25
Open
rahlk wants to merge 2 commits into
Open
feat: level-3 program graphs — native CFG/PDG/SDG and backward slicing (-a 3)#25rahlk wants to merge 2 commits into
rahlk wants to merge 2 commits into
Conversation
…icing (-a 3) Implements the dataflow half of #2: whole-program dependence graphs built in-process from the ts-morph AST, emitted as a schema-versioned program_graphs section of analysis.json, gated by -a 3 / --graphs. - src/dataflow/cfg.ts: exceptional statement-level CFG per callable (ENTRY/param/statement/EXIT nodes in source-span order; true/false, loop_back, switch_case, exception, await_resume, yield edge kinds; region-spliced try/catch/finally; synthetic loop-exit edge keeps EXIT the post-dominance root). - src/dataflow/dominance.ts: CHK iterative post-dominators + Ferrante–Ottenstein–Warren control dependence (CDG). - src/dataflow/defuse.ts: k-limited access paths (declaration-keyed bases: local/param/this/captured/module), copy-alias union-find MVP, forward reaching definitions, DDG extraction, capture-at-declaration for closures, EXIT-as-formal-out routing. - src/dataflow/summaries.ts: Tarjan SCC condensation of the provenance-merged call graph; bottom-up relational summaries (param→return, transitive global reads/writes) co-defined to a monotone fixpoint inside SCCs; persisted with dependency edges to graphs_summaries.json for later incrementality. - src/dataflow/sdg.ts: HRB stitching — CALL, PARAM_IN (args by position, globals as extra params), PARAM_OUT, and SUMMARY edges, all keyed by (signature, node_id) with no dangling endpoints. - src/dataflow/slice.ts: two-phase context-sensitive backward slicing as an SDG query. - CLI: -a 3, --graphs cfg,dfg,pdg,sdg, --graph-field-depth (strictly validated); -a 1/-a 2 are untouched (level 3 is fully flag-gated). - test/fixtures/dataflow-app + test/dataflow.test.ts: every contract gate with exact hand-computed expected sets (CFG reachability, CDG sets, loop-carried/shadowing/aliasing DDG, intraprocedural and interprocedural slices, SUMMARY for the a→b→c chain, global flow, mutual-recursion fixpoint, byte-identical determinism). Follow-ups staged in #2: taint models-as-data (PR E), Jelly points-to aliasing (PR F), CPG Neo4j projection (PR G), incremental re-analysis (PR H).
Implements the contract's parallel model on top of the sequential oracle: - Stage split: fact extraction (AST-bound, once per callable) is hoisted out of the summary fixpoint; the reaching-defs solve (defuse.solveDefUse) is now pure data and re-runs without touching the AST. CallableGraphData is the serializable per-callable projection that crosses the worker boundary. - Stages 1-4 (CFG, dominance, def-use facts, PDG) fan out per callable over a Bun worker pool, partitioned by file; each worker materializes its own whole-program ts-morph project (ASTs cannot be structured-cloned) and returns plain data. - The call-graph solve overlaps extraction: at -a 3, core.ts posts the extraction to the pool BEFORE the provider (tsc resolver + Jelly subprocess) runs on the main thread, and joins before summaries. - Stages 6-7 run as a Kahn-style ready-queue wavefront over the Tarjan SCC condensation DAG (per-SCC dependency counters; the SCC and its internal fixpoint are the atomic unit, one worker each). - Determinism: --jobs N output is byte-identical to --jobs 1 (span-ordered ids, collect-then-sort emission, sccFixpoint pure), enforced by a differential test. --jobs 1 is the sequential debug mode. - Failure discipline: a dying worker is retired and its queue never strands (a stranded queue previously let the process exit 0 without emitting output); extraction failure closes the pool so the wavefront degrades sequentially too — a warning, never wrong or missing output. - Compiled binary: the worker is a second bun build --compile entrypoint, embedded as /$bunfs/root/dataflow/worker.js; pool resolves the URL per runtime. Verified: dist/cants -a 3 -j 2 runs workers with byte-identical output. - Default is sequential: measurement (self-analysis: 36 files, 211 callables) shows per-worker project load dominates the parallelizable graph math at small/mid scale (2.5x slower at -j 14), so -j N is an explicit opt-in for large codebases. Also fixes a latent bug: TSCallable.path is absolute, so the summary cache's symbol_table[c.path] content-hash lookup always missed; now keyed by the project-relative file key.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #2 (PRs C′+D′ of the amended staging) — the full SDG: native, whole-program dependence graphs built in-process from the ts-morph AST, per the CLDK level-3 dataflow contract. No external engines; Jelly stays a frozen call-graph oracle.
What ships
A new pipeline stage (
src/dataflow/), fully flag-gated behind-a 3(levels 1/2 pay nothing):cfg.tsENTRY/EXIT,paramnodes, ids in source-span order;true/false,loop_back,switch_case,exception,break/continue/return, and TS-nativeawait_resume/yieldedge kinds; region-splicedtry/catch/finally;while (true)keeps its (dead) loop-exit edge so EXIT stays the unique post-dominance rootdominance.tsdefuse.ts--graph-field-depth, default 3) over declaration-keyed bases (local/param/this/captured/module); copy-alias union-find (the MVP substrate — Jelly points-to is staged PR F); forward reaching defs → labeled DDG; capture-at-declaration for closures; EXIT doubles as the HRB formal-outsummaries.tsgraphs_summaries.json(write-only today; PR H consumes it)index.ts,sdg.tsCALL,PARAM_IN(positional args + module globals as extra params),PARAM_OUT,SUMMARY— all keyed by(signature, node_id)with the call-graph no-dangling rule extended to graphsslice.tsEmission is a schema-versioned
program_graphssection ofanalysis.json, scoped by--graphs cfg,dfg,pdg,sdgwith strict flag validation (unknown values exit non-zero, never a silent fallback). Node identity uses the samesignatureOf()canonicalizer assymbol_table/call_graph, so everything joins.Verification (every contract gate, exact sets)
test/fixtures/dataflow-app+test/dataflow.test.ts(36 tests, 1.5k assertions):classifyandearly, exact.acc→acc, shadowed scopes don't leak, write-through-alias reaches read-through-original, closure capture edges.classify's return equals the hand-computed node set (correctly excluding the strongly-killed initializer).(signature, node_id)endpoints;CALLtargets ENTRY; positionalPARAM_INtargetsparamnodes; the composedSUMMARY arg0for thea→b→cchain; cross-file edges; a global written by one callee and read by the next materializing as caller-local DDG; mutual recursion reaching fixpoint.main's return slices to exactly{main, chain.a, chain.b, chain.c, state.bump, state.readCounter}with exact per-function node sets, context-sensitively.program_graphs;-a 1output contains no trace of the section.bun test: 47 pass / 0 fail (5 Docker-gated skips) ·tsc --noEmitclean ·bun run buildcompiles the standalone binary with the new stage bundled.Parallel execution model (second commit)
The contract's parallel model, built on the sequential run as the differential oracle:
CallableGraphDatais the serializable per-callable projection), so fixpoints re-run without re-walking the AST — a sequential win on its own.-a 3the extraction is posted to the pool before the provider (tsc resolver + Jelly subprocess) runs on the main thread — the "points-to solve concurrent with stages 1–4" slot — joined before summaries.--jobs Nis byte-identical to--jobs 1(span-ordered ids, collect-then-sort, puresccFixpoint), enforced by a differential test;--jobs 1remains the debug mode.bun build --compileentrypoint (embedded as$bunfs/root/dataflow/worker.js); verifieddist/cants -a 3 -j 2runs workers byte-identically.-j 14is 2.5× slower wall-clock because per-worker project load dominates the parallelizable graph math;-j Nis the explicit opt-in for large codebases.Also fixes a latent bug:
TSCallable.pathis absolute, so the summary cache's content-hash lookup (symbol_table[c.path]) always missed; now keyed by the project-relative file key.Deliberate scope cuts (staged in #2)
taint_flows(the fixture already carries the source/sink/sanitizer pair).schema.neo4j.jsonbump.thisflow) is recorded in.claude/SCHEMA_DECISIONS.mdand the README.