Skip to content

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for Python #67

Description

@rahlk

⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing moves to the SDK too

This issue already (correctly) put taint out of scope — see "OUT OF SCOPE" below. The family-wide boundary now standard (cldk-forge PR #7, dataflow-graphs.md § provider/client boundary; reference codellm-devkit/codeanalyzer-java#171) extends that to backward slicing as well: slicing is a reachability query over the SDG, same as taint, so it lives in the SDK, not the analyzer.

Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 3 (backward slicing as an SDG query), PART 3 item 12 (backward slicing + its expected-set gate), the "client gates" in the DoD, and the "slicing" step in DELIVERY. (The "OUT OF SCOPE → SDK-side taint" note already anticipated this — #229 is that SDK home, now covering slicing too.)

Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9), and the CPG projection (GOAL 2). The intraprocedural PDG construction and its correctness are still built and gated here; only the slice query over the finished SDG moves to the SDK. Retitled: dropped "and slicing".


PROBLEM

codeanalyzer-python today emits the level-1 symbol table and Jedi resolver call
graph, plus level-2 PyCG call-graph enrichment. It has no dataflow: no CFG, no
dependence graphs, no way to answer "what does this value affect". This issue
adds level 3 — native, whole-program dependence graphs built from Python's own
ast module, per the CLDK dataflow-graphs contract — and exposes backward
slicing as a query over them.

Native is the constraint: everything runs in-process in the analyzer's own
ecosystem. No external analysis engines, no subprocess to a foreign toolchain.

GOALS (the contract, in one list)

  1. Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json
    (program_graphs, schema_version'd, keyed by canonical (signature, node_id)),
    gated by -a 3 / --graphs.
  2. Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as
    new node labels / edge types; additive schema.neo4j.json bump.
  3. Expose backward slicing (two-phase context-sensitive HRB traversal) as an
    SDG query.
  4. Hold the cross-language parity clause: shared node kinds / edge types /
    JSON shapes; Python-specific additions are additive and recorded in
    .claude/SCHEMA_DECISIONS.md.
  5. Keep -a 1/-a 2 timings unaffected; summaries are content-hash-cacheable
    with recorded dependency structure (incrementality is later, but its hooks
    are now).
  6. Deterministic output: implement sequentially first and pass every gate;
    per-callable parallel fan-out (the repo already ships Ray for exactly this
    shape) is staged as follow-up work, differential-tested against the
    sequential output.

OUT OF SCOPE (deliberately)

  • Taint analysis: once the SDG is emitted, taint is labeled reachability over
    DDG ∪ PARAM_IN/OUT ∪ SUMMARY with sanitizer blocking — language-independent
    graph traversal that the CLDK SDK will own once, shared across analyzers.
    Only the source/sink/sanitizer model packs are language-specific, and those
    ship as data on the SDK side too.
  • Incremental re-analysis: aspirational; dependency structure is recorded from
    the start so it can be switched on later without a rewrite.

SUBSTRATE DECISIONS (locked)

  • CFG source: hand-built from the stdlib ast module — the same parse
    the symbol-table builder already uses; no second parse
    tree, no second identity mapping.
  • Def-use source: hand-built reaching definitions (classic forward
    worklist); no usable SSA library exists for Python that
    matches our node-identity needs.
  • Points-to oracle: type-based may-alias MVP stub — two access paths may
    alias iff their types are compatible, using Jedi's
    already-inferred types where available; unknown types
    conservatively may-alias. Sound-leaning but imprecise;
    upgrading to a real points-to substrate is staged as
    follow-up work. Call dispatch comes from the existing
    merged Jedi(+PyCG) call graph, treated as a frozen
    oracle.
  • Identity mapping: graph nodes are keyed by (signature, node_id) — the same
    PyCallable.signature keys as symbol_table / call_graph;
    node_id is the source-span-order statement index within
    the callable (synthetic ENTRY = 0, EXIT = last CFG node).
  • Precision posture: sound-leaning, over-approximate; flow-sensitive on
    locals, heap precision capped by the type-based oracle;
    k-limited access paths (--graph-field-depth, default 3).

PART 1 — INTRAPROCEDURAL GRAPHS

  1. Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT,
    multi-exit normalized; explicit lowering rules for the Python checklist:
    try/except/else/finally, with (implicit try/finally), comprehensions
    (own scope, modeled atomically), generators/yield, async/await,
    decorators (call-site fact, not CFG), raise, assert, short-circuit
    boolean operators, break/continue.
  2. Dominators + post-dominators (Cooper–Harper–Kennedy iterative); synthetic
    edge for infinite loops; control dependence via the post-dominance
    frontier walk (Ferrante–Ottenstein–Warren).
  3. Access-path variable model (k-limited, bases tagged local/param/self/
    global/capture); reaching definitions → DDG edges.
  4. PDG assembly; the intraprocedural backward-slice gate on the fixture
    (hand-computed expected node set, exact match).

PART 2 — INTERPROCEDURAL

  1. Frozen oracles: the merged Jedi(+PyCG) call graph for dispatch; the
    type-based may-alias stub for heap writes. Tarjan SCC condensation of the
    call graph as the bottom-up processing order.
  2. Global/module state (module bindings read/written across functions)
    recorded as extra summary inputs/outputs.
  3. Function summaries: per-callable formal-in → formal-out reachability over
    PDG + current callee summaries, composed bottom-up over the condensation
    DAG; monotone fixpoint within SCCs; k-limiting mandatory for termination.
    (Whole-function reachability instead of hammock-region decomposition —
    regions are an incrementality optimization, staged with incrementality.)
  4. External/library callees default to conservative pass-through: every
    argument flows to the return value and every mutable argument.
  5. SDG assembly: actual/formal in/out nodes, CALL / PARAM_IN / PARAM_OUT
    edges, SUMMARY edges from the composed summaries; globals ride as extra
    formals; closure captures modeled as capture formals bound at the
    definition site.

PART 3 — EMISSION AND CLIENTS

  1. program_graphs section in analysis.json per the contract; --graphs cfg,dfg,pdg,sdg selector and --graph-field-depth with strict flag
    validation; -a 3 extends the existing cumulative level semantics
    (level 3 ⊇ level 2's call graph, which the SDG needs).
  2. CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/
    SUMMARY/HAS_CFG_NODE in the neo4j/ subpackage; additive schema.neo4j.json
    bump; conformance test extended.
  3. Backward slicing (two-phase context-sensitive traversal) as an SDG query
    with an exact expected-set gate on the fixture.

CAVEATS AND KNOWN RISKS

  • Python is the weakest-oracle language for points-to: no in-process
    Andersen library exists. The type-based stub is intentionally
    over-approximate; do not trade soundness for a lower false-positive rate —
    ranking/pruning is downstream's job.
  • Inherited unsoundness in Python: eval/exec, reflection (getattr/setattr
    with dynamic names), monkey-patching, C extensions. Documented in the
    README, not silently absorbed.
  • Termination: the interprocedural fixpoint requires k-limiting (mandatory
    knob); summary domains are finite formal-in/formal-out sets.
  • Cost: whole-program summary construction is orders of magnitude slower
    than level 1; everything is gated behind -a 3 and -a 1/-a 2 timings are
    CI-checked to be unaffected.
  • Determinism: node ids are assigned in source-span order during sequential
    construction; emission is collect-then-sort. Parallel fan-out lands later,
    differential-tested against sequential output.

DELIVERY

Single feature branch with stage-per-commit history (fixture → CFG → dominance
→ def-use → PDG → oracles → summaries/SDG → slicing → schema/CLI emission →
CPG projection → docs), one PR referencing this issue. Follow-ups staged as
separate issues/PRs: real points-to substrate, per-callable parallel fan-out,
incremental re-analysis, SDK-side taint.

VERIFICATION / DEFINITION OF DONE

  • Every gate in the dataflow-construction spec passes on the fixture (CFG,
    dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
    sets, not "non-empty".
  • Fixture covers the full Python lowering checklist plus the shared fixture
    minimums (aliasing, SCC recursion, multi-file flow, closure capture,
    module-global flow).
  • analysis.json with -a 3 validates against the program_graphs models;
    parity clause holds (no renamed/repurposed shared vocabulary).
  • Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count
    matches JSON; no dangling edges (deferred-edge gate).
  • -a 1 / -a 2 wall-clock unchanged within noise on the benchmark fixture,
    and their output is byte-identical to pre-change output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions