Back to Blog
AIKnowledge GraphsLLMDeveloper Tools

Graphify — Turn Any Codebase into a Knowledge Graph

Inspired by Andrej Karpathy’s LLM knowledge base workflow, graphify builds knowledge graphs from code, docs, papers, and images — giving AI assistants structural understanding instead of brute-force search. 71.5x token reduction on real-world corpora.

2026-04-06

The Problem: Raw Data Doesn't Scale

Andrej Karpathy recently described his workflow for building personal knowledge bases with LLMs: drop papers, tweets, screenshots, and notes into a /rawfolder, then use an LLM to “compile” a wiki of interlinked markdown files. The LLM writes and maintains the wiki. You query it, file the answers back, and the knowledge base compounds over time.

It's a powerful idea, but it's held together by scripts. There's no structural understanding of what connects to what, no confidence scores on inferred relationships, and no way to ask “what path connects concept A to concept B?” without the LLM re-reading everything.

graphify is an open-source tool that takes this idea and gives it structure. Instead of a flat wiki, it builds a knowledge graph— with typed nodes, weighted edges, community detection, and confidence scoring — from any mix of code, docs, papers, and images.

How It Works: Two-Pass Extraction

graphify runs in two passes over your corpus:

Pass 1: Deterministic AST (Code Files)

For .py, .ts, .go, .rs, .java, and 8 more languages, tree-sitter parses the AST to extract classes, functions, imports, call graphs, docstrings, and rationale comments (# WHY:, # HACK:, etc.). No LLM needed — no tokens spent, no file contents leave your machine.

Pass 2: Semantic Extraction (Docs, Papers, Images)

Claude subagents run in parallel over markdown, PDFs, and images (including screenshots, diagrams, and whiteboard photos in any language) to extract concepts, relationships, and design rationale. The results merge with the AST graph into a unified NetworkX structure.

The merged graph is then clustered using Leiden community detection — a graph-topology algorithm that finds communities by edge density. No embeddings, no vector database. The graph structure is the similarity signal.

What You Get

graphify-out/
├── graph.html       # interactive graph — click, search, filter by community
├── GRAPH_REPORT.md  # god nodes, surprising connections, suggested questions
├── graph.json       # persistent graph — query weeks later without re-reading
└── cache/           # SHA256 cache — re-runs only process changed files
Interactive knowledge graph generated by graphify — nodes colored by community, edges showing relationships between code, docs, and images

Interactive graph output from graphify — communities are color-coded, node size reflects connectivity

God nodes— highest-degree concepts that everything connects through. The architectural backbone of your codebase or research corpus.
Surprising connections— ranked by composite score. Code-paper edges rank higher than code-code. Each includes a plain-English “why.”
Confidence scores — every edge is tagged EXTRACTED (found in source, always 1.0), INFERRED (with a 0.0–1.0 confidence), or AMBIGUOUS (flagged for review). You always know what was found vs. guessed.
Semantic similarity edges— cross-file conceptual links with no structural connection. Two functions solving the same problem without calling each other, a class in code and a concept in a paper.
The “why”— docstrings, rationale comments, and design reasoning are extracted as dedicated nodes. Not just what the code does — why it was written that way.

Key Features

Fully Multimodal

Drop in code, PDFs, markdown, screenshots, diagrams, whiteboard photos, images in other languages — graphify uses Claude vision to extract concepts from all of it and connects them into one graph. This is what makes it a true knowledge compiler, not just a code indexer.

71.5x Token Reduction

On a mixed corpus (Karpathy repos + papers + images, 52 files), querying the graph uses 71.5x fewer tokensthan reading raw files. The first run extracts and builds the graph (this costs tokens). Every subsequent query reads the compact graph — that's where the savings compound. Token reduction scales with corpus size.

Always-On Assistant Integration

After building a graph, a single command installs a hook that makes your AI assistant read the graph report before answering architecture questions:

graphify claude install   # Claude Code: CLAUDE.md + PreToolUse hook
graphify codex install    # Codex: AGENTS.md
graphify opencode install # OpenCode: AGENTS.md

Think of it this way: the always-on hook gives your assistant a map. The /graphify query commands let it navigate the map precisely:

/graphify query "what connects attention to the optimizer?"
/graphify query "what connects attention to the optimizer?" --dfs   # trace a specific path
/graphify query "what connects attention to the optimizer?" --budget 1500  # cap at N tokens
/graphify path "DigestAuth" "Response"
/graphify explain "SwinTransformer"

Incremental Updates & Auto-Sync

A SHA256 cache means re-runs only process changed files. The --watch flag auto-syncs the graph as your codebase changes (code saves trigger instant rebuild via AST, doc changes notify for LLM re-pass). Git hooks rebuild the graph on every commit and branch switch.

Flexible Export

/graphify ./raw --obsidian    # generate Obsidian vault
/graphify ./raw --wiki        # agent-crawlable wiki
/graphify ./raw --svg         # export graph.svg
/graphify ./raw --graphml     # Gephi, yEd
/graphify ./raw --neo4j       # generate Cypher for Neo4j
/graphify ./raw --neo4j-push bolt://localhost:7687

Benchmark Results

CorpusFilesToken Reduction
Karpathy repos + 5 papers + 4 images5271.5x
graphify source + Transformer paper45.4x
httpx (synthetic Python library)6~1x

At 6 files the corpus fits in a context window anyway, so graph value is structural clarity, not compression. At 52 files (code + papers + images), you get 71x+ reduction.

Where It Helps

Codebase onboarding— understand architecture, god nodes, and cross-cutting concerns before writing a single line.
Research knowledge management— Karpathy's workflow, productized. Drop papers, tweets, screenshots, and notes into a folder; get a navigable, queryable knowledge graph.
Architecture auditing— find god nodes (single points of coupling), surprising connections between modules, and design rationale buried in comments.
AI-assisted development— give your coding assistant structural understanding instead of brute-force grep. Answers are faster, cheaper, and more accurate.
Multi-format analysis— connect code to the papers that describe its algorithms, to the diagrams that visualize its architecture, to the tickets that motivated its design.

Getting Started

pip install graphifyy && graphify install

Then open your AI coding assistant and type:

/graphify .

Note

Requires Python 3.10+ and one of: Claude Code, Codex, OpenCode, or OpenClaw. Code files are processed locally via tree-sitter — no file contents leave your machine. Only docs, papers, and images are sent to your platform's model API for semantic extraction.

The project is open-source and available on GitHub. It runs entirely locally (NetworkX + Leiden + tree-sitter + vis.js), with no Neo4j or server required.

Building AI-powered developer tools?

We integrate LLMs, knowledge graphs, and intelligent automation into developer workflows. Let’s talk about your use case.

Send a Message

Related Articles