Level-3: native dataflow graphs (CFG/PDG/SDG/CPG) and backward slicing#68
Open
rahlk wants to merge 11 commits into
Open
Level-3: native dataflow graphs (CFG/PDG/SDG/CPG) and backward slicing#68rahlk wants to merge 11 commits into
rahlk wants to merge 11 commits into
Conversation
Level-3 groundwork (#67): hand-built CFG from the stdlib ast with the shared node/edge vocabulary, Python lowering rules (try/except/else/ finally, with, yield/await resume kinds, break/continue, synthetic escape edge for infinite loops), dead-code pruning, and source-span- ordered node ids (ENTRY=0, EXIT=last). Dataflow fixture project and CFG gate tests included.
Cooper–Harper–Kennedy iterative post-dominators over the reverse CFG (unique root EXIT, guaranteed by stage 1's synthetic escape edges) and Ferrante–Ottenstein–Warren control dependence with ENTRY as the region root. Gate tests pin exact hand-computed CDG sets for the fixture's if/loop/early-return functions. (#67)
k-limited access-path model with per-scope base classification (local/ param/self/global/capture), header-only facts for compound statements, comprehension scoping, closure-capture and call-mutation rules; classic worklist reaching definitions with strong kills on exact non-wildcard paths; DDG edges via textual interference plus the type-based may-alias oracle (the locked MVP points-to substrate — unknown types conservatively alias, incompatible types don't). Gate tests cover the loop-carried dependency, scope shadowing, and the aliased write/read pair. (#67)
PDG = CDG ∪ DDG per callable over the same node ids; intraprocedural backward slice as reverse reachability. Gate pins hand-computed exact slices: the early-return arm is excluded from the other arm's slice, loop slices close over the loop-carried dependency. (#67)
…lification Iterative Tarjan SCC condensation of the frozen call-graph oracle (reverse topological schedule for bottom-up summaries); call mutations become suffixed weak defs so caller-visible mutation is distinguishable from local rebinding; global bases gain module::name qualification for the interprocedural build. (#67)
Relational summaries (params/captures/read-globals → return/mutations/ written-globals) composed bottom-up over the Tarjan condensation DAG, monotone fixpoint within SCCs, callee global footprints injected at callsites and reaching definitions re-solved; HRB parameter structure (formal/actual in/out nodes in the owning function's id space after EXIT), CALL/PARAM_IN/PARAM_OUT edges, SUMMARY edges from composed flows, globals as extra formals, closure captures bound at definition sites; builder maps symbol-table signatures to AST by (file, line) and treats the call graph and Jedi callsite resolutions as frozen oracles. Gates: arity, no dangling endpoints, transitive-chain SUMMARY, cross- file global flow, deterministic double-run. (#67)
Classic HRB traversal over the assembled SDG: phase 1 ascends and skips across callsites via SUMMARY edges (never PARAM_OUT), phase 2 descends (never PARAM_IN/CALL) — call–return matching without re-descent. Gate pins an exact hand-computed interprocedural slice (caller_of_mutate → mutate) plus cross-file global descent and no-reascend properties. (#67)
…d-depth program_graphs schema section (PyProgramGraphs and friends, versioned 1.0.0 independently of the application schema) attached to PyApplication; -a extended to 3 (cumulative: level 3 keeps PyCG enrichment); --graphs cfg,dfg,pdg,sdg selector with strict validation (unknown values and level<3 usage exit non-zero, never silently fall back); --graph-field-depth k-limit knob recorded in the output. -a 1/2 emit no program_graphs and their pipeline is untouched. (#67)
CFGNode label (merge key id = <signature>#<node_id>) carrying both CFG statements and HRB parameter nodes, plus the shared cross-language edge vocabulary HAS_CFG_NODE / CFG_NEXT / CDG / DDG / PARAM_IN / PARAM_OUT / SUMMARY (deliberately unprefixed — parity clause). Additive schema.neo4j.json bump to 1.2.0; sample app extended so the conformance tests exercise every new row family; count-parity and no-dangling gates on the real fixture at -a 3. CALL stays at the callable level (PY_CALLS twin). (#67)
…ions README gains the level table, the locked level-3 substrate decisions (CFG from stdlib ast, hand-built reaching defs, type-based may-alias MVP, documented unsoundness), a level-3 usage example, and a regenerated --help block; CHANGELOG Unreleased entry; level-3 schema decision log tracked at .claude/SCHEMA_DECISIONS.md as SDK-model input (un-ignored past the global .claude exclude). (#67)
…dges) Unprefixed CFGNode/CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY would mingle analyzers' dependence edges in a Neo4j database holding more than one language's graph — SDK backends scope queries by label/type prefix. The vocabulary stays cross-language in shape (same suffixes, props, semantics) but is PY_-namespaced in the projection like every other row family; the JSON program_graphs section keeps the unprefixed contract since each analysis.json is its own namespace. Decision recorded in .claude/SCHEMA_DECISIONS.md. (#67)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #67.
Adds analysis level 3: native, whole-program dependence graphs (CFG → PDG → SDG) built in-process from the stdlib
ast, emitted as theprogram_graphssection ofanalysis.json, projected into Neo4j as the CPG overlay, and queryable with a context-sensitive backward slicer. Built stage-per-commit along the CLDK dataflow construction ladder, each stage gated by its contract tests before the next began.What each commit delivers
module::nameglobal qualification.PyProgramGraphsschema section (ownschema_version1.0.0);-a 3;--graphs cfg,dfg,pdg,sdgand--graph-field-depthwith strict validation (unknown values / use below-a 3exit non-zero).PyCFGNode+PY_HAS_CFG_NODE/PY_CFG_NEXT/PY_CDG/PY_DDG/PY_PARAM_IN/PY_PARAM_OUT/PY_SUMMARYthrough the existingneo4j/row machinery — cross-language in shape, PY_-namespaced like every other row family so multi-language databases never mingle analyzers' dependence edges (the JSONprogram_graphssection keeps the unprefixed shared contract; eachanalysis.jsonis its own namespace); additiveschema.neo4j.jsonbump to 1.2.0; conformance tests extended.--help; CHANGELOG;.claude/SCHEMA_DECISIONS.md(the SDK-model input).Verification
analysis.jsonat-a 3round-trips through the Pydantic models;-a 1/-a 2emit noprogram_graphsand their pipeline is untouched (level 3 is a single gated block after app assembly).test_cli_call_symbol_table_with_json, the xarray--raytest) fails identically onmain— a pre-existing local Ray worker-startup issue, unrelated to this branch.Scope notes (per the epic)