Graph extraction

Graph extraction is a pipeline stage that turns extracted text into a knowledge graph. It runs after text extraction and stores the graph in a Neo4j backend, so you can experiment with GraphRAG and graph-aware retrieval without changing ingestion or extraction.

Graph extraction is not retrieval. It is a separate stage that produces a graph artifact you can query from a graph-aware retriever or analysis tool.

Where graph extraction sits

Corpus → Extraction pipeline → Extraction snapshot
                                   ↓
                        Graph extraction (pluggable)
                                   ↓
                            Neo4j (namespaced)
                                   ↓
                        Graph-aware retrievers

Why use graph extraction

Use graph extraction when you need:

  • Entity and relationship signals beyond lexical similarity.

  • Graph traversal and community structure for retrieval expansion.

  • Side-by-side comparison of extraction methods.

  • A stable graph layer for GraphRAG experiments.

Core concepts

Graph extractor A pluggable component that receives extracted text for one item and returns nodes and edges.

Graph snapshot A versioned record of a graph extraction run tied to a corpus and extraction snapshot.

Graph identifier A deterministic identifier derived from extractor ID and configuration, used to namespace graphs in Neo4j.

Deterministic identifiers

Graph extraction favors deterministic identifiers so runs are reproducible:

  • graph_id: {extractor_id}:{config_hash}

  • node_id: {node_type}:{canonical_form}

  • edge_id: {src}|{edge_type}|{dst}

These IDs make it possible to deduplicate and compare graphs across extraction methods.

Baseline graph extractors

Biblicus includes deterministic baseline extractors for graph experiments:

  • simple-entities: tokenized entity extraction without external NLP dependencies.

  • ner-entities: named entity recognition using an NLP model.

  • dependency-relations: subject–verb–object relation extraction using an NLP model.

Use these baselines to compare graph-aware retrieval approaches with a shared corpus and extraction snapshot.

Graph storage model

Graph data is stored in a single Neo4j instance. Every node and edge is namespaced with properties that identify which corpus and graph they belong to.

Example node properties:

{
  "corpus_id": "research-corpus",
  "graph_id": "simple-entities:abc123",
  "extraction_snapshot_id": "pipeline:abc123",
  "item_id": "doc-42",
  "node_type": "entity",
  "node_id": "entity:person:john_smith",
  "label": "John Smith",
  "properties_json": "{\"canonical\": \"john_smith\"}"
}

Example edge properties:

{
  "corpus_id": "research-corpus",
  "graph_id": "simple-entities:abc123",
  "extraction_snapshot_id": "pipeline:abc123",
  "item_id": "doc-42",
  "edge_type": "mentions",
  "edge_id": "entity:person:john_smith|mentions|item:doc-42",
  "properties_json": "{}"
}

Graph properties are stored as JSON strings in the properties_json field so they remain stable across Neo4j versions. If you need to query nested properties, parse the JSON in your client or load it into a structured view before querying.

Graph extractor interface

Graph extractors follow a per-item API. They validate configuration with Pydantic and return a list of nodes and edges for each catalog item.

Conceptual interface:

class GraphExtractor(ABC):
    extractor_id: str

    def validate_config(self, config: Dict[str, Any]) -> BaseModel:
        ...

    def extract_graph(
        self,
        *,
        corpus: Corpus,
        item: CatalogItem,
        extracted_text: ExtractedText,
        config: BaseModel,
    ) -> GraphExtractionResult:
        ...

Running graph extraction

Graph extraction runs against an extraction snapshot. The command below builds a graph using a co-occurrence extractor and stores it in Neo4j.

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor cooccurrence \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/cooccurrence.yml

If you omit --extraction-snapshot, Biblicus uses the latest extraction snapshot and emits a reproducibility warning.

Example configurations

Minimal co-occurrence configuration:

schema_version: 1
window_size: 4
min_cooccurrence: 2

Minimal simple-entities configuration:

schema_version: 1
min_entity_length: 3
max_entity_words: 4
include_item_node: true

Simple entities extractor

The simple-entities extractor emits entity nodes based on capitalized phrases and acronyms. It also emits:

  • mentions edges from item nodes to entities

  • related_to edges for entity co-occurrence within a sentence

It is deterministic and works without external dependencies, so it is a good baseline for GraphRAG experiments.

Example command:

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor simple-entities \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/simple-entities.yml

NER entities extractor

The ner-entities extractor uses a named entity recognition model to emit entity nodes. It emits:

  • mentions edges from item nodes to entities

  • typed entity nodes via the entity_type property

This extractor is deterministic for a fixed model and configuration and provides a stronger baseline than the simple-entities heuristic.

Install spaCy and the model referenced in your configuration before running:

python -m pip install spacy
python -m spacy download en_core_web_sm

Example command:

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor ner-entities \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/ner-entities.yml

Minimal configuration:

schema_version: 1
model: en_core_web_sm
min_entity_length: 3
include_item_node: true

Dependency relations extractor

The dependency-relations extractor builds edges from dependency parses (subject–verb–object and similar patterns). It emits:

  • mentions edges from item nodes to entities

  • related_to edges between entities with a predicate property for the verb lemma

This extractor provides relation-centric baselines that are still deterministic and non-LLM.

Install spaCy and the model referenced in your configuration before running:

python -m pip install spacy
python -m spacy download en_core_web_sm

Example command:

python -m biblicus graph extract \
  --corpus corpora/example \
  --extractor dependency-relations \
  --extraction-snapshot pipeline:RUN_ID \
  --configuration configurations/graph/dependency-relations.yml

Minimal configuration:

schema_version: 1
model: en_core_web_sm
min_entity_length: 3
include_item_node: true

Querying a logical graph

Neo4j queries are scoped by corpus_id and graph_id so you can store multiple graphs side by side:

MATCH (n {corpus_id: $corpus, graph_id: $graph})-[r]->(m)
WHERE n.node_type = 'entity'
RETURN n, r, m

Graph-aware retrieval

Graph-aware retrievers implement the existing retriever interface. They can:

  1. Extract entities from the query.

  2. Match those entities to graph nodes.

  3. Traverse the graph for expansion.

  4. Score items by graph proximity.

  5. Return evidence with graph stage scores.

This lets you combine graph signals with lexical or embedding retrievers inside a hybrid configuration.

Local Neo4j setup

For local development, Biblicus can auto-start a local Neo4j container when graph extraction runs. The container is started only if it is not already running. You can also run Neo4j manually via Docker:

python -m pip install neo4j
docker run --rm --name biblicus-neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/testpassword \
  neo4j:5

Set Neo4j connection details in user configuration before running graph extraction.

Repeatable integration script

Use the integration script to download a small Wikipedia corpus, run extraction, and build a Neo4j graph snapshot. The script logs each phase so you can narrate the workflow in a demo or integration run.

python scripts/graph_extraction_integration.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --verify \
  --report-path reports/graph_extraction_story.md

Narrative demo script

Use the narrative demo script when you want to show inputs and outputs for a specific extractor. It prints input text previews, sampled entities or terms, and example edges.

python scripts/graph_extraction_extractor_demo.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --extractor simple-entities

Narrative demo script (all extractors)

Run all extractors in sequence and write per-extractor reports:

python scripts/graph_extraction_demo_all.py \
  --corpus corpora/wiki_graph_demo \
  --force \
  --limit 5 \
  --report-dir reports

Reproducibility checklist

  • Record the extraction snapshot reference used to build the graph.

  • Record the graph extractor ID and configuration hash.

  • Keep graph queries scoped by corpus_id and graph_id.

Next steps

  • Add additional extractors for entity and relation extraction.

  • Build a graph-aware retriever and compare it against lexical baselines.

  • Evaluate graph extraction approaches using shared datasets.