Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for Python

> ### ⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing moves to the SDK too
>
> This issue already (correctly) put **taint** out of scope — see "OUT OF SCOPE" below. The family-wide boundary now standard (cldk-forge PR #7, `dataflow-graphs.md § provider/client boundary`; reference codellm-devkit/codeanalyzer-java#171) extends that to **backward slicing** as well: slicing is a reachability *query* over the SDG, same as taint, so it lives in the SDK, not the analyzer.
>
> **Superseded here → filed as codellm-devkit/python-sdk#229** (the shared client-analysis engine): **GOAL 3** (backward slicing as an SDG query), **PART 3 item 12** (backward slicing + its expected-set gate), the "client gates" in the DoD, and the "slicing" step in DELIVERY. (The "OUT OF SCOPE → SDK-side taint" note already anticipated this — #229 is that SDK home, now covering slicing too.)
>
> **Stays in this analyzer:** the full graph substrate — **GOAL 1** (`program_graphs`: CFG/PDG/SDG), the **transitive `SUMMARY` edges** (PART 2 item 9), and the CPG projection (**GOAL 2**). The intraprocedural PDG **construction** and its correctness are still built and gated here; only the *slice query* over the finished SDG moves to the SDK. Retitled: dropped "and slicing".

---

PROBLEM

codeanalyzer-python today emits the level-1 symbol table and Jedi resolver call
graph, plus level-2 PyCG call-graph enrichment. It has no dataflow: no CFG, no
dependence graphs, no way to answer "what does this value affect". This issue
adds level 3 — native, whole-program dependence graphs built from Python's own
`ast` module, per the CLDK dataflow-graphs contract — and exposes backward
slicing as a query over them.

Native is the constraint: everything runs in-process in the analyzer's own
ecosystem. No external analysis engines, no subprocess to a foreign toolchain.

GOALS (the contract, in one list)

1. Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json
   (`program_graphs`, schema_version'd, keyed by canonical (signature, node_id)),
   gated by `-a 3` / `--graphs`.
2. Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as
   new node labels / edge types; additive schema.neo4j.json bump.
3. Expose backward slicing (two-phase context-sensitive HRB traversal) as an
   SDG query.
4. Hold the cross-language parity clause: shared node kinds / edge types /
   JSON shapes; Python-specific additions are additive and recorded in
   .claude/SCHEMA_DECISIONS.md.
5. Keep `-a 1`/`-a 2` timings unaffected; summaries are content-hash-cacheable
   with recorded dependency structure (incrementality is later, but its hooks
   are now).
6. Deterministic output: implement sequentially first and pass every gate;
   per-callable parallel fan-out (the repo already ships Ray for exactly this
   shape) is staged as follow-up work, differential-tested against the
   sequential output.

OUT OF SCOPE (deliberately)

- Taint analysis: once the SDG is emitted, taint is labeled reachability over
  DDG ∪ PARAM_IN/OUT ∪ SUMMARY with sanitizer blocking — language-independent
  graph traversal that the CLDK SDK will own once, shared across analyzers.
  Only the source/sink/sanitizer model packs are language-specific, and those
  ship as data on the SDK side too.
- Incremental re-analysis: aspirational; dependency structure is recorded from
  the start so it can be switched on later without a rewrite.

SUBSTRATE DECISIONS (locked)

  - CFG source:        hand-built from the stdlib `ast` module — the same parse
                       the symbol-table builder already uses; no second parse
                       tree, no second identity mapping.
  - Def-use source:    hand-built reaching definitions (classic forward
                       worklist); no usable SSA library exists for Python that
                       matches our node-identity needs.
  - Points-to oracle:  type-based may-alias MVP stub — two access paths may
                       alias iff their types are compatible, using Jedi's
                       already-inferred types where available; unknown types
                       conservatively may-alias. Sound-leaning but imprecise;
                       upgrading to a real points-to substrate is staged as
                       follow-up work. Call dispatch comes from the existing
                       merged Jedi(+PyCG) call graph, treated as a frozen
                       oracle.
  - Identity mapping:  graph nodes are keyed by (signature, node_id) — the same
                       PyCallable.signature keys as symbol_table / call_graph;
                       node_id is the source-span-order statement index within
                       the callable (synthetic ENTRY = 0, EXIT = last CFG node).
  - Precision posture: sound-leaning, over-approximate; flow-sensitive on
                       locals, heap precision capped by the type-based oracle;
                       k-limited access paths (--graph-field-depth, default 3).

PART 1 — INTRAPROCEDURAL GRAPHS

  1. Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT,
     multi-exit normalized; explicit lowering rules for the Python checklist:
     try/except/else/finally, with (implicit try/finally), comprehensions
     (own scope, modeled atomically), generators/yield, async/await,
     decorators (call-site fact, not CFG), raise, assert, short-circuit
     boolean operators, break/continue.
  2. Dominators + post-dominators (Cooper–Harper–Kennedy iterative); synthetic
     edge for infinite loops; control dependence via the post-dominance
     frontier walk (Ferrante–Ottenstein–Warren).
  3. Access-path variable model (k-limited, bases tagged local/param/self/
     global/capture); reaching definitions → DDG edges.
  4. PDG assembly; the intraprocedural backward-slice gate on the fixture
     (hand-computed expected node set, exact match).

PART 2 — INTERPROCEDURAL

  5. Frozen oracles: the merged Jedi(+PyCG) call graph for dispatch; the
     type-based may-alias stub for heap writes. Tarjan SCC condensation of the
     call graph as the bottom-up processing order.
  6. Global/module state (module bindings read/written across functions)
     recorded as extra summary inputs/outputs.
  7. Function summaries: per-callable formal-in → formal-out reachability over
     PDG + current callee summaries, composed bottom-up over the condensation
     DAG; monotone fixpoint within SCCs; k-limiting mandatory for termination.
     (Whole-function reachability instead of hammock-region decomposition —
     regions are an incrementality optimization, staged with incrementality.)
  8. External/library callees default to conservative pass-through: every
     argument flows to the return value and every mutable argument.
  9. SDG assembly: actual/formal in/out nodes, CALL / PARAM_IN / PARAM_OUT
     edges, SUMMARY edges from the composed summaries; globals ride as extra
     formals; closure captures modeled as capture formals bound at the
     definition site.

PART 3 — EMISSION AND CLIENTS

 10. `program_graphs` section in analysis.json per the contract; `--graphs
     cfg,dfg,pdg,sdg` selector and `--graph-field-depth` with strict flag
     validation; `-a 3` extends the existing cumulative level semantics
     (level 3 ⊇ level 2's call graph, which the SDG needs).
 11. CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/
     SUMMARY/HAS_CFG_NODE in the neo4j/ subpackage; additive schema.neo4j.json
     bump; conformance test extended.
 12. Backward slicing (two-phase context-sensitive traversal) as an SDG query
     with an exact expected-set gate on the fixture.

CAVEATS AND KNOWN RISKS

  - Python is the weakest-oracle language for points-to: no in-process
    Andersen library exists. The type-based stub is intentionally
    over-approximate; do not trade soundness for a lower false-positive rate —
    ranking/pruning is downstream's job.
  - Inherited unsoundness in Python: eval/exec, reflection (getattr/setattr
    with dynamic names), monkey-patching, C extensions. Documented in the
    README, not silently absorbed.
  - Termination: the interprocedural fixpoint requires k-limiting (mandatory
    knob); summary domains are finite formal-in/formal-out sets.
  - Cost: whole-program summary construction is orders of magnitude slower
    than level 1; everything is gated behind -a 3 and -a 1/-a 2 timings are
    CI-checked to be unaffected.
  - Determinism: node ids are assigned in source-span order during sequential
    construction; emission is collect-then-sort. Parallel fan-out lands later,
    differential-tested against sequential output.

DELIVERY

Single feature branch with stage-per-commit history (fixture → CFG → dominance
→ def-use → PDG → oracles → summaries/SDG → slicing → schema/CLI emission →
CPG projection → docs), one PR referencing this issue. Follow-ups staged as
separate issues/PRs: real points-to substrate, per-callable parallel fan-out,
incremental re-analysis, SDK-side taint.

VERIFICATION / DEFINITION OF DONE

  - Every gate in the dataflow-construction spec passes on the fixture (CFG,
    dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
    sets, not "non-empty".
  - Fixture covers the full Python lowering checklist plus the shared fixture
    minimums (aliasing, SCC recursion, multi-file flow, closure capture,
    module-global flow).
  - analysis.json with -a 3 validates against the program_graphs models;
    parity clause holds (no renamed/repurposed shared vocabulary).
  - Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count
    matches JSON; no dangling edges (deferred-edge gate).
  - -a 1 / -a 2 wall-clock unchanged within noise on the benchmark fixture,
    and their output is byte-identical to pre-change output.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for Python #67

⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing moves to the SDK too

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for Python #67

Description

⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing moves to the SDK too

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions