Skip to content

Level-3: native dataflow graphs (CFG/PDG/SDG/CPG) and backward slicing#68

Open
rahlk wants to merge 11 commits into
mainfrom
feat/level3-dataflow-sdg
Open

Level-3: native dataflow graphs (CFG/PDG/SDG/CPG) and backward slicing#68
rahlk wants to merge 11 commits into
mainfrom
feat/level3-dataflow-sdg

Conversation

@rahlk

@rahlk rahlk commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Closes #67.

Adds analysis level 3: native, whole-program dependence graphs (CFG → PDG → SDG) built in-process from the stdlib ast, emitted as the program_graphs section of analysis.json, projected into Neo4j as the CPG overlay, and queryable with a context-sensitive backward slicer. Built stage-per-commit along the CLDK dataflow construction ladder, each stage gated by its contract tests before the next began.

What each commit delivers

  1. CFG — statement-level, exceptional CFG per callable: synthetic ENTRY/EXIT, multi-exit normalized, full Python lowering (try/except/else/finally, with, yield/await resume kinds, break/continue, synthetic escape edge for infinite loops), dead-code pruning, source-span-ordered node ids.
  2. Dominance — Cooper–Harper–Kennedy post-dominators; Ferrante–Ottenstein–Warren control dependence with ENTRY as region root.
  3. Def-use — k-limited access paths with per-scope base classification (local/param/self/global/capture); reaching definitions; DDG via textual interference + the type-based may-alias oracle (the locked MVP points-to substrate).
  4. PDG — CDG ∪ DDG; exact hand-computed intraprocedural slice gates.
  5. SCC + qualification — iterative Tarjan condensation of the frozen call-graph oracle; suffixed call-mutation defs; module::name global qualification.
  6. Summaries + SDG — relational formal-in → formal-out summaries composed bottom-up over the condensation DAG (monotone fixpoint within SCCs); HRB parameter nodes; CALL/PARAM_IN/PARAM_OUT/SUMMARY edges; globals as extra formals; closure captures bound at def sites.
  7. Slicing — two-phase context-sensitive HRB backward slice; exact interprocedural expected-set gate.
  8. Emission + CLIPyProgramGraphs schema section (own schema_version 1.0.0); -a 3; --graphs cfg,dfg,pdg,sdg and --graph-field-depth with strict validation (unknown values / use below -a 3 exit non-zero).
  9. CPGPyCFGNode + PY_HAS_CFG_NODE/PY_CFG_NEXT/PY_CDG/PY_DDG/PY_PARAM_IN/PY_PARAM_OUT/PY_SUMMARY through the existing neo4j/ row machinery — cross-language in shape, PY_-namespaced like every other row family so multi-language databases never mingle analyzers' dependence edges (the JSON program_graphs section keeps the unprefixed shared contract; each analysis.json is its own namespace); additive schema.neo4j.json bump to 1.2.0; conformance tests extended.
  10. Docs — README analysis-level table + locked Architecture & Tooling decisions + regenerated --help; CHANGELOG; .claude/SCHEMA_DECISIONS.md (the SDK-model input).

Verification

  • Every construction-ladder gate passes with exact expected sets (CFG reachability/vocabulary/stability, hand-computed CDG sets, loop-carried DDG + scope shadowing + alias pair, exact intra- and inter-procedural slices, SUMMARY-edge/arity/no-dangling SDG gates, CPG count parity).
  • analysis.json at -a 3 round-trips through the Pydantic models; -a 1/-a 2 emit no program_graphs and their pipeline is untouched (level 3 is a single gated block after app assembly).
  • Full suite: 98 passed; the one failure (test_cli_call_symbol_table_with_json, the xarray --ray test) fails identically on main — a pre-existing local Ray worker-startup issue, unrelated to this branch.

Scope notes (per the epic)

  • Taint is deliberately not emitted: post-SDG it is language-independent labeled reachability and belongs in the CLDK SDK, with per-language source/sink model packs as data.
  • Points-to is the type-based MVP stub (sound-leaning; unknown types alias); upgrading to a real substrate, per-callable parallel fan-out, and incremental re-analysis are staged follow-ups.
  • Precision posture is sound-leaning/over-approximate; known unsoundness (eval/exec, reflection, monkey-patching, C extensions, module top-level) is documented in the README, not silently absorbed.

rahlk added 11 commits July 1, 2026 21:34
Level-3 groundwork (#67): hand-built CFG from the stdlib ast with the
shared node/edge vocabulary, Python lowering rules (try/except/else/
finally, with, yield/await resume kinds, break/continue, synthetic
escape edge for infinite loops), dead-code pruning, and source-span-
ordered node ids (ENTRY=0, EXIT=last). Dataflow fixture project and
CFG gate tests included.
Cooper–Harper–Kennedy iterative post-dominators over the reverse CFG
(unique root EXIT, guaranteed by stage 1's synthetic escape edges) and
Ferrante–Ottenstein–Warren control dependence with ENTRY as the region
root. Gate tests pin exact hand-computed CDG sets for the fixture's
if/loop/early-return functions. (#67)
k-limited access-path model with per-scope base classification (local/
param/self/global/capture), header-only facts for compound statements,
comprehension scoping, closure-capture and call-mutation rules; classic
worklist reaching definitions with strong kills on exact non-wildcard
paths; DDG edges via textual interference plus the type-based may-alias
oracle (the locked MVP points-to substrate — unknown types conservatively
alias, incompatible types don't). Gate tests cover the loop-carried
dependency, scope shadowing, and the aliased write/read pair. (#67)
PDG = CDG ∪ DDG per callable over the same node ids; intraprocedural
backward slice as reverse reachability. Gate pins hand-computed exact
slices: the early-return arm is excluded from the other arm's slice,
loop slices close over the loop-carried dependency. (#67)
…lification

Iterative Tarjan SCC condensation of the frozen call-graph oracle
(reverse topological schedule for bottom-up summaries); call mutations
become suffixed weak defs so caller-visible mutation is distinguishable
from local rebinding; global bases gain module::name qualification for
the interprocedural build. (#67)
Relational summaries (params/captures/read-globals → return/mutations/
written-globals) composed bottom-up over the Tarjan condensation DAG,
monotone fixpoint within SCCs, callee global footprints injected at
callsites and reaching definitions re-solved; HRB parameter structure
(formal/actual in/out nodes in the owning function's id space after
EXIT), CALL/PARAM_IN/PARAM_OUT edges, SUMMARY edges from composed
flows, globals as extra formals, closure captures bound at definition
sites; builder maps symbol-table signatures to AST by (file, line) and
treats the call graph and Jedi callsite resolutions as frozen oracles.
Gates: arity, no dangling endpoints, transitive-chain SUMMARY, cross-
file global flow, deterministic double-run. (#67)
Classic HRB traversal over the assembled SDG: phase 1 ascends and skips
across callsites via SUMMARY edges (never PARAM_OUT), phase 2 descends
(never PARAM_IN/CALL) — call–return matching without re-descent. Gate
pins an exact hand-computed interprocedural slice (caller_of_mutate →
mutate) plus cross-file global descent and no-reascend properties. (#67)
…d-depth

program_graphs schema section (PyProgramGraphs and friends, versioned
1.0.0 independently of the application schema) attached to
PyApplication; -a extended to 3 (cumulative: level 3 keeps PyCG
enrichment); --graphs cfg,dfg,pdg,sdg selector with strict validation
(unknown values and level<3 usage exit non-zero, never silently fall
back); --graph-field-depth k-limit knob recorded in the output. -a 1/2
emit no program_graphs and their pipeline is untouched. (#67)
CFGNode label (merge key id = <signature>#<node_id>) carrying both CFG
statements and HRB parameter nodes, plus the shared cross-language edge
vocabulary HAS_CFG_NODE / CFG_NEXT / CDG / DDG / PARAM_IN / PARAM_OUT /
SUMMARY (deliberately unprefixed — parity clause). Additive
schema.neo4j.json bump to 1.2.0; sample app extended so the conformance
tests exercise every new row family; count-parity and no-dangling gates
on the real fixture at -a 3. CALL stays at the callable level (PY_CALLS
twin). (#67)
…ions

README gains the level table, the locked level-3 substrate decisions
(CFG from stdlib ast, hand-built reaching defs, type-based may-alias
MVP, documented unsoundness), a level-3 usage example, and a
regenerated --help block; CHANGELOG Unreleased entry; level-3 schema
decision log tracked at .claude/SCHEMA_DECISIONS.md as SDK-model
input (un-ignored past the global .claude exclude). (#67)
…dges)

Unprefixed CFGNode/CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY would
mingle analyzers' dependence edges in a Neo4j database holding more
than one language's graph — SDK backends scope queries by label/type
prefix. The vocabulary stays cross-language in shape (same suffixes,
props, semantics) but is PY_-namespaced in the projection like every
other row family; the JSON program_graphs section keeps the unprefixed
contract since each analysis.json is its own namespace. Decision
recorded in .claude/SCHEMA_DECISIONS.md. (#67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for Python

1 participant