Skip to content

codellm-devkit/codeanalyzer-clang

Repository files navigation

CodeLLM-DevKit

codeanalyzer-clang (canclang)

A C/C++ static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph, using LLVM/Clang.

PyPI License


canclang is a static analyzer for C and C++ built on LLVM/Clang via libclang. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json — a symbol table plus a call graph — and can project that same analysis into a Neo4j property graph. It is the C/C++ backend behind CLDK, mirroring its Python (canpy), TypeScript (cants), and Java siblings — so output-shape parity with them is a first-class concern.

Because libclang is the Clang front end, structural parsing and type/overload/virtual-dispatch resolution come from one tool: the level-1 call graph is resolved directly from the Clang AST (cursor.referenced), not a shallow name match.

Table of Contents

Features

  • Symbol table — translation units, classes/structs/unions, methods, free functions, constructors/destructors, fields, globals, enums, typedefs, macros, #includes, and doc comments, with precise source spans and rich C/C++ flags (virtual, pure_virtual, const, static, inline, variadic, storage class, access specifier, templates, namespaces).
  • Call graph — resolved directly from the Clang AST: identity-only edges whose endpoints are real symbol-table signatures, with provenance=["clang"]. Constructors, member/virtual dispatch, and cross-file (out-of-line) method definitions are handled; indirect (function-pointer) calls are flagged and skipped.
  • Neo4j output — project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
  • Versioned schema — a machine-readable, version-stamped Neo4j schema contract (--emit schema).
  • Caching — a content-hash per-file cache under .codeanalyzer-clang/, so re-analysis only touches what changed.

Architecture & Tooling

These are the load-bearing backend decisions for this analyzer (see .claude/SCHEMA_DECISIONS.md for the full node-by-node schema rationale):

codeanalyzer-clang — architecture & tooling
  depth:          level 1 — symbol table + libclang resolver call graph; level 2 (SVF) stubbed
  runtime:        Python (libclang bindings), invoked in-process by the SDK
  structural:     libclang / Clang AST (clang.cindex)
  resolution:     libclang / Clang AST — SAME tool (cursor.referenced resolves callees,
                  incl. C++ overloads and virtual dispatch)
  framework (L2): LLVM-IR + SVF points-to — OFF by default, behind --svf, stubbed in this build
  build/deps:     optional compile_commands.json compilation database (accurate include paths/flags);
                  degrades to a language default (-x c/-x c++ -std=…) when absent
  packaging:      pip package `codeanalyzer-clang`, invoked in-process (the Python-analyzer
                  exception to the self-contained-binary rule); `pip install libclang` bundles
                  the native lib, so SDK users need no separate LLVM
  extra nodes:    struct/union (record_kind), enum, typedef, macro, namespace (as signature scope)

Rationale for the non-default choices. The self-contained-binary packaging rule is waived for one reason: this analyzer is written in Python (mirroring codeanalyzer-python), so it ships as a pip package invoked in-process — no subprocess, no cross-compiled binary. The heavy level-2 backend is SVF (Andersen/Steensgaard points-to over LLVM bitcode), which is stronger than RTA and is the one new-language case (like Java/WALA) with a true heavyweight builder available; it is scaffolded but stubbed, since level 1 is the default depth.

Installation

Prerequisites

  • Python ≥ 3.9 (tested on 3.14).
  • libclang — the native Clang library. The libclang PyPI wheel (a declared dependency) bundles it, so pip install codeanalyzer-clang is normally self-sufficient. If you prefer a system LLVM:
    • macOS: brew install llvm
    • Debian/Ubuntu: apt-get install libclang-dev
  • (optional) a compile_commands.json compilation database for accurate include paths — generate it with cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON … or Bear.
  • (optional, --emit neo4j --neo4j-uri) the neo4j driver: pip install 'codeanalyzer-clang[neo4j]'.

Install via pip (PyPI)

pip install codeanalyzer-clang
# with the live Neo4j push extra:
pip install 'codeanalyzer-clang[neo4j]'

Install via Homebrew

brew tap codellm-devkit/tap
brew install codeanalyzer-clang

Build from source

git clone https://github.com/codellm-devkit/codeanalyzer-clang.git
cd codeanalyzer-clang
python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev,neo4j]"
canclang --version

Usage

# symbol table only (level 1, default), JSON to a temp dir
canclang -i /path/to/project -o /tmp/out -a 1

# symbol table + resolver call graph (level 2)
canclang -i /path/to/project -o /tmp/out -a 2 -c ~/.cldk/clang-cache

# single-file incremental analysis, compact JSON to stdout
canclang -i /path/to/project --file-name src/main.cpp

# with an accurate compilation database
canclang -i /path/to/project --compile-commands build -a 2 -o /tmp/out

# project into a Neo4j graph snapshot
canclang -i /path/to/project -a 2 --emit neo4j -o /tmp/out    # writes graph.cypher

Options

Usage: canclang [OPTIONS]

  Static analysis for C and C++ using LLVM/Clang (libclang).

Options:
  --version                       Show the canclang version and exit.
  -i, --input PATH                Path to the C/C++ project root (not required for --emit schema).
  -o, --output PATH               Output directory for artifacts. Omit to print compact JSON to stdout.
  -f, --format [json|msgpack]     Output format for --emit json (default: json).
  --emit [json|neo4j|schema]      Output target (default: json).
  -a, --analysis-level INT 1..2   1 = symbol table only; 2 = + libclang resolver call graph.
  --svf / --no-svf                Add the heavy level-2 SVF points-to call graph (stubbed).
  -t, --target-files PATH         Restrict analysis to these files (relative to --input).
  --file-name PATH                Analyze only this single file (relative to --input).
  --skip-tests / --include-tests  Skip test trees (default: skip).
  --compile-commands PATH         Directory containing compile_commands.json (auto-detected).
  --std TEXT                      Override the C/C++ standard (e.g. c11, c++20).
  --eager / --lazy                Force a clean rebuild vs reuse cache (default: lazy).
  -c, --cache-dir PATH            Directory for the analysis cache.
  --clear-cache / --keep-cache    Clear cache after analysis (default: keep).
  --app-name TEXT                 :ClangApplication anchor name (default: input dir name).
  --neo4j-uri TEXT                Live Bolt push target; omit to write graph.cypher. [env: NEO4J_URI]
  --neo4j-user TEXT               Neo4j username. [env: NEO4J_USERNAME]
  --neo4j-password TEXT           Neo4j password. [env: NEO4J_PASSWORD]
  --neo4j-database TEXT           Neo4j database name. [env: NEO4J_DATABASE]
  -v                              Increase verbosity: -v, -vv.
  --help                          Show this message and exit.

Examples

# emit the static Neo4j schema contract (no project needed)
canclang --emit schema -o /tmp/out          # writes schema.json

# push into a live Neo4j incrementally
NEO4J_URI=bolt://localhost:7687 NEO4J_PASSWORD=secret \
  canclang -i /path/to/project -a 2 --emit neo4j

Analysis levels

  • Level 1 (-a 1, default) — the symbol table only. Call sites are recorded on each callable with callee_signature == null; call_graph is [].
  • Level 2 (-a 2) — also the resolver-based call graph: the same Clang AST resolves each call site, backfills callee_signature in place, and emits identity-only edges. Still cheap (the resolver is already loaded).
  • --svf — the heavy, framework-based backend (LLVM-IR + SVF points-to). Stubbed in this build: the seams (semantic_analysis/svf/) exist and the flag is wired, but no extra edges are produced yet. Level-1 edges are unaffected.

Output targets

  • analysis.json (default) — the canonical symbol table + call graph. Written to -o, or printed as compact JSON to stdout when -o is omitted (the SDK facade contract).
  • Neo4j graph (--emit neo4j) — a self-contained graph.cypher snapshot, or a live Bolt push with --neo4j-uri. An alternative projection of the same in-memory IR, not an ingestion of the JSON.
  • Schema contract (--emit schema) — the machine-readable, version-stamped Neo4j schema (schema.json).

Output schema

The output validates against the CLDK ClangApplication contract: { symbol_table: { <relative file path>: ClangModule }, call_graph: [ClangCallEdge], ... }, with identity-only edges whose source/target byte-match a real ClangCallable.signature. Signatures are human-readable, fully-qualified, and overload-disambiguated (app::Point::add(int)); one signature_of() canonicalizer produces every id. See .claude/SCHEMA_DECISIONS.md for the field-by-field contract and the C/C++-specific extensions.

SDK integration

The CLDK SDKs bind this analyzer — in the Python SDK via CLDK.clang(project_path=...) (with the legacy CLDK(language="clang").analysis(...) shim), and the other SDKs as they come online. Because this analyzer is a Python package, the Python SDK invokes it in-process (imports Codeanalyzer / AnalysisOptions, calls .analyze(), gets the ClangApplication back with no JSON round-trip). The SDK wiring is done by the cldk-sdk-frontend skill.

Development

pip install -e ".[dev,neo4j]"
pytest                       # symbol-table, call-graph, caching, CLI, and Neo4j-conformance gates
canclang -i testdata/fixture -a 2 -o /tmp/out && cat /tmp/out/analysis.json

The analyzer is a modular package mirroring codeanalyzer-python: a delegating core.analyze(), a node-kind-split syntactic_analysis/symbol_table_builder.py, an isolated semantic_analysis/svf/ level-2 subpackage, a pluggable analysis/ (pass registry) + frameworks/ (entrypoint-finder base) layer, and the neo4j/ projection.

License

Apache-2.0. See LICENSE.

About

Codeanalyzer backend for C Language Family

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors