Application Programming Interface Reference

Core

Corpus storage and ingestion for Biblicus.

class biblicus.corpus.Corpus(root)[source]

Local corpus manager for Biblicus.

Variables:

root (Path) – Corpus root directory.
meta_dir (Path) – Metadata directory under the corpus root.
raw_dir (Path) – Raw item directory under the corpus root.
config (CorpusConfig or None) – Parsed corpus config, if present.

Parameters:

root (Path)

property analysis_dir: Path

Location of analysis artifacts for the corpus.

Returns:: Analysis artifacts directory.
Return type:: Path

analysis_run_dir(*, analysis_id, snapshot_id)[source]

Resolve an analysis snapshot directory.

Parameters:

analysis_id (str) – Analysis backend identifier.
snapshot_id (str) – Analysis snapshot identifier.

Returns:

Analysis snapshot directory.

Return type:

Path

property analysis_runs_dir: Path

Location of analysis snapshot artifacts.

Returns:: Path to the analysis snapshots directory.
Return type:: Path

catalog_generated_at()[source]

Return the catalog generation timestamp.

Returns:: International Organization for Standardization 8601 timestamp.
Return type:: str

property catalog_path: Path

Return the path to the corpus catalog file.

Returns:: Catalog file path.
Return type:: Path

create_crawl_id()[source]

Create a new crawl identifier.

Returns:: Crawl identifier.
Return type:: str

delete_extraction_snapshot(*, extractor_id, snapshot_id)[source]

Delete an extraction snapshot directory and its derived artifacts.

Parameters:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.

Returns:

None.

Return type:

None

Raises:

FileNotFoundError – If the extraction snapshot directory does not exist.

property extracted_dir: Path

Location of extraction artifacts for the corpus.

Returns:: Extracted artifacts directory.
Return type:: Path

extraction_snapshot_dir(*, extractor_id, snapshot_id)[source]

Resolve an extraction snapshot directory.

Parameters:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.

Returns:

Extraction snapshot directory.

Return type:

Path

property extraction_snapshots_dir: Path

Location of extraction snapshot artifacts.

Returns:: Path to the extraction snapshots directory.
Return type:: Path

classmethod find(start)[source]

Locate a corpus by searching upward from a path.

Parameters:: start (Path) – Starting path to search.
Returns:: Located corpus instance.
Return type:: Corpus
Raises:: FileNotFoundError – If no corpus config is found.

get_item(item_id)[source]

Fetch a catalog item by identifier.

Parameters:: item_id (str) – Item identifier.
Returns:: Catalog item.
Return type:: CatalogItem
Raises:: KeyError – If the item identifier is unknown.

property graph_dir: Path

Location of graph artifacts for the corpus.

Returns:: Graph artifacts directory.
Return type:: Path

graph_snapshot_dir(*, extractor_id, snapshot_id)[source]

Resolve a graph snapshot directory.

Parameters:

extractor_id (str) – Graph extractor identifier.
snapshot_id (str) – Graph snapshot identifier.

Returns:

Graph snapshot directory.

Return type:

Path

property graph_snapshots_dir: Path

Location of graph snapshot artifacts.

Returns:: Path to the graph snapshots directory.
Return type:: Path

has_items()[source]

Return whether the corpus catalog contains any items.

Returns:: True when the catalog has at least one item.
Return type:: bool

import_tree(source_root, *, tags=())[source]

Import a folder tree into the corpus, preserving relative paths and provenance.

Imported content must already live under the corpus root. The import registers files in-place and writes sidecars when needed.

Parameters:

source_root (Path) – Root directory of the folder tree to import.
tags (Sequence[str]) – Tags to associate with imported items.

Returns:

Import statistics.

Return type:

dict[str, int]

Raises:

FileNotFoundError – If the source_root does not exist.
ValueError – If the source root is outside the corpus root.

ingest_crawled_payload(*, crawl_id, relative_path, data, filename, media_type, source_uri, tags)[source]

Ingest a crawled payload under a crawl import namespace.

Parameters:

crawl_id (str) – Crawl identifier used to group crawled artifacts.
relative_path (str) – Relative path within the crawl prefix.
data (bytes) – Raw payload bytes.
filename (str) – Suggested filename from the payload metadata.
media_type (str) – Internet Assigned Numbers Authority media type.
source_uri (str) – Source uniform resource identifier (typically an http or https uniform resource locator).
tags (Sequence[str]) – Tags to attach to the stored item.

Returns:

None.

Return type:

None

ingest_item(data, *, filename=None, media_type='application/octet-stream', title=None, tags=(), metadata=None, source_uri='unknown', storage_subdir='imports')[source]

Ingest a single raw item into the corpus.

This is the modality-neutral primitive: callers provide bytes + a media type. Higher-level conveniences (ingest_note, ingest_source, and related methods) build on top.

Parameters:

data (bytes) – Raw item bytes.
filename (str or None) – Optional filename for the stored item.
media_type (str) – Internet Assigned Numbers Authority media type for the item.
title (str or None) – Optional title metadata.
tags (Sequence[str]) – Tags to associate with the item.
metadata (dict[str, Any] or None) – Optional metadata mapping.
source_uri (str) – Source uniform resource identifier for provenance.
storage_subdir (str or None) – Optional subdirectory under the raw root.

Returns:

Ingestion result summary.

Return type:

IngestResult

Raises:

ValueError – If markdown is not Unicode Transformation Format 8.
IngestCollisionError – If a source uniform resource identifier is already ingested.

ingest_item_stream(stream, *, filename=None, media_type='application/octet-stream', tags=(), metadata=None, source_uri='unknown', storage_subdir='imports')[source]

Ingest a binary item from a readable stream.

This method is intended for large non-markdown items. It writes bytes to disk incrementally while computing a checksum.

Parameters:

stream (object) – Readable binary stream.
filename (str or None) – Optional filename for the stored item.
media_type (str) – Internet Assigned Numbers Authority media type for the item.
tags (Sequence[str]) – Tags to associate with the item.
metadata (dict[str, Any] or None) – Optional metadata mapping.
source_uri (str) – Source uniform resource identifier for provenance.
storage_subdir (str or None) – Optional subdirectory under the raw root.

Returns:

Ingestion result summary.

Return type:

IngestResult

Raises:

ValueError – If the media_type is text/markdown.

ingest_note(text, *, title=None, tags=(), source_uri=None)[source]

Ingest a text note as Markdown.

Parameters:

text (str) – Note content.
title (str or None) – Optional title metadata.
tags (Sequence[str]) – Tags to associate with the note.
source_uri (str or None) – Optional source uniform resource identifier for provenance.

Returns:

Ingestion result summary.

Return type:

IngestResult

ingest_source(source, *, tags=(), source_uri=None, allow_external=False)[source]

Ingest a file path or uniform resource locator source.

Parameters:

source (str or Path) – File path or uniform resource locator.
tags (Sequence[str]) – Tags to associate with the item.
source_uri (str or None) – Optional override for the source uniform resource identifier.
allow_external (bool) – Whether to ingest files outside the corpus root by copying them into imports.

Returns:

Ingestion result summary.

Return type:

IngestResult

classmethod init(root, *, force=False)[source]

Initialize a new corpus on disk.

Parameters:

root (Path) – Corpus root directory.
force (bool) – Whether to overwrite existing config.

Returns:

Initialized corpus instance.

Return type:

Corpus

Raises:

FileExistsError – If the corpus already exists and force is False.

latest_extraction_snapshot_reference(*, extractor_id=None)[source]

Return the most recent extraction snapshot reference.

Parameters:: extractor_id (str or None) – Optional extractor identifier filter.
Returns:: Latest extraction snapshot reference or None when no snapshots exist.
Return type:: biblicus.models.ExtractionSnapshotReference or None

property latest_snapshot_id: str | None

Latest retrieval snapshot identifier recorded in the catalog.

Returns:: Latest snapshot identifier or None.
Return type:: str or None

list_extraction_snapshots(*, extractor_id=None)[source]

List extraction snapshots stored under the corpus.

Parameters:: extractor_id (str or None) – Optional extractor identifier filter.
Returns:: Summary list entries for each snapshot.
Return type:: list[biblicus.models.ExtractionSnapshotListEntry]

list_items(*, limit=50)[source]

List items from the catalog.

Parameters:: limit (int) – Maximum number of items to return.
Returns:: Catalog items ordered by recency.
Return type:: list[CatalogItem]

load_catalog()[source]

Load the current corpus catalog.

Returns:

Parsed corpus catalog.

Return type:

CorpusCatalog

Raises:

FileNotFoundError – If the catalog file does not exist.
ValueError – If the catalog schema is invalid.

load_extraction_snapshot_manifest(*, extractor_id, snapshot_id)[source]

Load an extraction snapshot manifest from the corpus.

Parameters:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.

Returns:

Parsed extraction snapshot manifest.

Return type:

biblicus.extraction.ExtractionSnapshotManifest

Raises:

FileNotFoundError – If the manifest file does not exist.
ValueError – If the manifest data is invalid.

load_snapshot(snapshot_id)[source]

Load a retrieval snapshot manifest by identifier.

Parameters:: snapshot_id (str) – Snapshot identifier.
Returns:: Parsed snapshot manifest.
Return type:: RetrievalSnapshot
Raises:: FileNotFoundError – If the snapshot manifest does not exist.

property name: str

Return the corpus name (directory basename).

Returns:: Corpus name.
Return type:: str

classmethod open(ref)[source]

Open a corpus from a path or uniform resource identifier reference.

Parameters:: ref (str or Path) – Filesystem path or file:// uniform resource identifier.
Returns:: Opened corpus instance.
Return type:: Corpus

pull_source(*, tag_resolver=None)[source]

Mirror a remote source into the corpus.

Returns:: Pull summary.
Return type:: RemoteSourcePullResult
Raises:: ValueError – If the corpus has no configured remote source.
Parameters:: tag_resolver (Callable[[str], List[str]] | None)

purge(*, confirm)[source]

Delete all ingested items and derived files, preserving corpus identity/config.

Parameters:: confirm (str) – Confirmation string matching the corpus name.
Returns:: None.
Return type:: None
Raises:: ValueError – If the confirmation does not match.

read_extracted_text(*, extractor_id, snapshot_id, item_id)[source]

Read extracted text for an item from an extraction snapshot, when present.

Parameters:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.
item_id (str) – Item identifier.

Returns:

Extracted text or None if the artifact does not exist.

Return type:

str or None

Raises:

OSError – If the file exists but cannot be read.

reindex()[source]

Rebuild/refresh the corpus catalog from the current on-disk corpus contents.

This is the core “mutable corpus with re-indexing” loop: edit raw files or sidecars, then reindex to refresh the derived catalog.

Returns:: Reindex statistics.
Return type:: dict[str, int]
Raises:: ValueError – If a markdown file cannot be decoded as Unicode Transformation Format 8.

property retrieval_dir: Path

Location of retrieval artifacts for the corpus.

Returns:: Retrieval artifacts directory.
Return type:: Path

property snapshots_dir: Path

Location of retrieval snapshot manifests.

Returns:: Path to the snapshots directory.
Return type:: Path

property uri: str

Return the canonical uniform resource identifier for the corpus root.

Returns:: Corpus uniform resource identifier.
Return type:: str

write_snapshot(snapshot)[source]

Persist a retrieval snapshot manifest and update the catalog pointer.

Parameters:: snapshot (RetrievalSnapshot) – Snapshot manifest to persist.
Returns:: None.
Return type:: None

High-level knowledge base workflow for turnkey usage.

class biblicus.knowledge_base.KnowledgeBase(corpus, retriever_id, snapshot, defaults, _temp_dir)[source]

High-level knowledge base wrapper for turnkey workflows.

Variables:

corpus (Corpus) – Corpus instance that stores the ingested items.
retriever_id (str) – Retriever identifier used for retrieval.
snapshot (RetrievalSnapshot) – Retrieval snapshot manifest associated with the knowledge base.
defaults (KnowledgeBaseDefaults) – Default configuration used for this knowledge base.

Parameters:

corpus (Corpus)
retriever_id (str)
snapshot (RetrievalSnapshot)
defaults (KnowledgeBaseDefaults)
_temp_dir (TemporaryDirectory | None)

context_pack(result, *, join_with='\n\n', max_tokens=None)[source]

Build a context pack from a retrieval result.

Parameters:

result (RetrievalResult) – Retrieval result to convert into context.
join_with (str) – Join string for evidence blocks.
max_tokens (int or None) – Optional token budget for the context pack.

Returns:

Context pack text and metadata.

Return type:

ContextPack

corpus: Corpus

defaults: KnowledgeBaseDefaults

classmethod from_folder(folder, *, retriever_id=None, configuration_name=None, query_budget=None, tags=None, corpus_root=None)[source]

Build a knowledge base from a folder of files.

Parameters:

folder (str or Path) – Folder containing source files.
retriever_id (str or None) – Optional retriever identifier override.
configuration_name (str or None) – Optional configuration name override.
query_budget (QueryBudget or None) – Optional query budget override.
tags (Sequence[str] or None) – Optional tags to apply during import.
corpus_root (str or Path or None) – Optional corpus root override. Must contain the source folder.

Returns:

Knowledge base instance.

Return type:

KnowledgeBase

Raises:

FileNotFoundError – If the folder does not exist.
NotADirectoryError – If the folder is not a directory.

query(query_text, *, budget=None)[source]

Query the knowledge base for evidence.

Parameters:

query_text (str) – Query text to execute.
budget (QueryBudget or None) – Optional budget override.

Returns:

Retrieval result containing evidence.

Return type:

RetrievalResult

retriever_id: str

snapshot: RetrievalSnapshot

class biblicus.knowledge_base.KnowledgeBaseDefaults(*, retriever_id='scan', configuration_name='Knowledge base', query_budget=<factory>, tags=<factory>)[source]

Default configuration for a knowledge base workflow.

Variables:

retriever_id (str) – Retriever identifier to use for retrieval.
configuration_name (str) – Human-readable retrieval configuration name.
query_budget (QueryBudget) – Default query budget to apply to retrieval.
tags (list[str]) – Tags to apply when importing the folder.

Parameters:

retriever_id (str)
configuration_name (str)
query_budget (QueryBudget)
tags (List[str])

configuration_name: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query_budget: QueryBudget

retriever_id: str

tags: List[str]

Pydantic models for Biblicus domain concepts.

class biblicus.models.CatalogItem(*, id, relpath, sha256, bytes, media_type, title=None, tags=<factory>, metadata=<factory>, created_at, source_uri=None)[source]

Catalog entry derived from a raw corpus item.

Variables:

id (str) – Universally unique identifier of the item.
relpath (str) – Relative path to the raw item file.
sha256 (str) – Secure Hash Algorithm 256 digest of the stored bytes.
bytes (int) – Size of the raw item in bytes.
media_type (str) – Internet Assigned Numbers Authority media type for the item.
title (str or None) – Optional human title extracted from metadata.
tags (list[str]) – Tags extracted or supplied for the item.
metadata (dict[str, Any]) – Merged front matter or sidecar metadata.
created_at (str) – International Organization for Standardization 8601 timestamp when the item was first indexed.
source_uri (str or None) – Optional source uniform resource identifier used at ingestion time.

Parameters:

id (str)
relpath (str)
sha256 (str)
bytes (Annotated[int, Ge(ge=0)])
media_type (str)
title (str | None)
tags (List[str])
metadata (Dict[str, Any])
created_at (str)
source_uri (str | None)

bytes: int

created_at: str

id: str

media_type: str

metadata: Dict[str, Any]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

relpath: str

sha256: str

source_uri: str | None

tags: List[str]

title: str | None

class biblicus.models.CollectionMembership(*, collection_name, corpus_name)[source]

Collection membership metadata for a corpus.

Variables:

collection_name (str) – Collection name.
corpus_name (str) – Corpus name within the collection.

Parameters:

collection_name (Annotated[str, MinLen(min_length=1)])
corpus_name (Annotated[str, MinLen(min_length=1)])

collection_name: str

corpus_name: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.models.ConfigurationManifest(*, configuration_id, retriever_id, name, created_at, configuration=<factory>, description=None)[source]

Reproducible configuration for a retriever.

Variables:

configuration_id (str) – Deterministic configuration identifier.
retriever_id (str) – Retriever identifier for the configuration.
name (str) – Human-readable name for the configuration.
created_at (str) – International Organization for Standardization 8601 timestamp for configuration creation.
configuration (dict[str, Any]) – Retriever-specific configuration values.
description (str or None) – Optional human description.

Parameters:

configuration_id (str)
retriever_id (str)
name (str)
created_at (str)
configuration (Dict[str, Any])
description (str | None)

configuration: Dict[str, Any]

configuration_id: str

created_at: str

description: str | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

retriever_id: str

class biblicus.models.CorpusCatalog(*, schema_version, generated_at, corpus_uri, raw_dir='.', latest_run_id=None, latest_snapshot_id=None, items=<factory>, order=<factory>)[source]

Snapshot of the derived corpus catalog.

Variables:

schema_version (int) – Version of the catalog schema.
generated_at (str) – International Organization for Standardization 8601 timestamp of catalog generation.
corpus_uri (str) – Canonical uniform resource identifier for the corpus root.
raw_dir (str) – Relative path to the raw items folder.
latest_run_id (str or None) – Latest extraction run identifier, if any.
latest_snapshot_id (str or None) – Latest retrieval snapshot identifier, if any.
items (dict[str, CatalogItem]) – Mapping of item IDs to catalog entries.
order (list[str]) – Display order of item IDs (most recent first).

Parameters:

schema_version (Annotated[int, Ge(ge=1)])
generated_at (str)
corpus_uri (str)
raw_dir (str)
latest_run_id (str | None)
latest_snapshot_id (str | None)
items (Dict[str, CatalogItem])
order (List[str])

corpus_uri: str

generated_at: str

items: Dict[str, CatalogItem]

latest_run_id: str | None

latest_snapshot_id: str | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

order: List[str]

raw_dir: str

schema_version: int

class biblicus.models.CorpusConfig(**data)[source]

Canonical on-disk config for a local Biblicus corpus.

Variables:

schema_version (int) – Version of the corpus config schema.
created_at (str) – International Organization for Standardization 8601 timestamp for corpus creation.
corpus_uri (str) – Canonical uniform resource identifier for the corpus root.
raw_dir (str) – Relative path to the raw items folder.
notes (dict[str, Any] or None) – Optional free-form notes for operators.
hooks (list[HookSpec] or None) – Optional hook specifications for corpus lifecycle events.
collection (CollectionMembership or None) – Optional collection membership metadata.

Parameters:

data (Any)

collection: 'CollectionMembership' | None

corpus_uri: str

created_at: str

hooks: List[HookSpec] | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

notes: Dict[str, Any] | None

raw_dir: str

schema_version: int

source: 'RemoteCorpusSourceConfig' | None

class biblicus.models.Evidence(*, item_id, source_uri=None, media_type, score, rank, text=None, content_ref=None, span_start=None, span_end=None, stage, stage_scores=None, configuration_id, snapshot_id, metadata=<factory>, hash=None)[source]

Structured retrieval evidence returned from a retriever.

Variables:

item_id (str) – Item identifier that produced the evidence.
source_uri (str or None) – Source uniform resource identifier from ingestion metadata.
media_type (str) – Media type for the evidence item.
score (float) – Retrieval score (higher is better).
rank (int) – Rank within the final evidence list (1-based).
text (str or None) – Optional text payload for the evidence.
content_ref (str or None) – Optional reference for non-text content.
span_start (int or None) – Optional start offset in the source text.
span_end (int or None) – Optional end offset in the source text.
stage (str) – Retrieval stage label (for example, scan, full-text search, rerank).
stage_scores (dict[str, float] or None) – Optional per-stage scores for multi-stage retrieval.
configuration_id (str) – Configuration identifier used to create the snapshot.
snapshot_id (str) – Retrieval snapshot identifier.
metadata (dict[str, Any]) – Optional metadata payload from the catalog item.
hash (str or None) – Optional content hash for provenance.

Parameters:

item_id (str)
source_uri (str | None)
media_type (str)
score (float)
rank (Annotated[int, Ge(ge=1)])
text (str | None)
content_ref (str | None)
span_start (int | None)
span_end (int | None)
stage (str)
stage_scores (Dict[str, float] | None)
configuration_id (str)
snapshot_id (str)
metadata (Dict[str, Any])
hash (str | None)

configuration_id: str

content_ref: str | None

hash: str | None

item_id: str

media_type: str

metadata: Dict[str, Any]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

rank: int

score: float

snapshot_id: str

source_uri: str | None

span_end: int | None

span_start: int | None

stage: str

stage_scores: Dict[str, float] | None

text: str | None

class biblicus.models.ExtractedText(*, text, producer_extractor_id, source_stage_index=None, confidence=None, metadata=<factory>)[source]

Text payload produced by an extractor plugin.

Variables:

text (str) – Extracted text content.
producer_extractor_id (str) – Extractor identifier that produced this text.
source_stage_index (int or None) – Optional pipeline stage index where this text originated.
confidence (float or None) – Optional confidence score from 0.0 to 1.0.
metadata (dict[str, Any]) – Optional structured metadata for passing data between pipeline stages.

Parameters:

text (str)
producer_extractor_id (Annotated[str, MinLen(min_length=1)])
source_stage_index (Annotated[int | None, Ge(ge=1)])
confidence (Annotated[float | None, Ge(ge=0.0), Le(le=1.0)])
metadata (Dict[str, Any])

confidence: float | None

metadata: Dict[str, Any]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

producer_extractor_id: str

source_stage_index: int | None

text: str

class biblicus.models.ExtractionSnapshotListEntry(*, extractor_id, snapshot_id, configuration_id, configuration_name, catalog_generated_at, created_at, stats=<factory>)[source]

Summary entry for an extraction snapshot stored in a corpus.

Variables:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.
configuration_id (str) – Deterministic configuration identifier.
configuration_name (str) – Human-readable configuration name.
catalog_generated_at (str) – Catalog timestamp used for the snapshot.
created_at (str) – International Organization for Standardization 8601 timestamp for snapshot creation.
stats (dict[str, object]) – Snapshot statistics.

Parameters:

extractor_id (Annotated[str, MinLen(min_length=1)])
snapshot_id (Annotated[str, MinLen(min_length=1)])
configuration_id (Annotated[str, MinLen(min_length=1)])
configuration_name (Annotated[str, MinLen(min_length=1)])
catalog_generated_at (Annotated[str, MinLen(min_length=1)])
created_at (Annotated[str, MinLen(min_length=1)])
stats (Dict[str, object])

catalog_generated_at: str

configuration_id: str

configuration_name: str

created_at: str

extractor_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

stats: Dict[str, object]

class biblicus.models.ExtractionSnapshotReference(*, extractor_id, snapshot_id)[source]

Reference to an extraction snapshot.

Variables:

extractor_id (str) – Extractor plugin identifier.
snapshot_id (str) – Extraction snapshot identifier.

Parameters:

extractor_id (Annotated[str, MinLen(min_length=1)])
snapshot_id (Annotated[str, MinLen(min_length=1)])

as_string()[source]

Serialize the reference as a single string.

Returns:: Reference in the form extractor_id:snapshot_id.
Return type:: str

extractor_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

class biblicus.models.ExtractionStageOutput(*, stage_index, extractor_id, status, text=None, text_characters=0, producer_extractor_id=None, source_stage_index=None, confidence=None, metadata=<factory>, error_type=None, error_message=None)[source]

In-memory representation of a pipeline stage output for a single item.

Variables:

stage_index (int) – One-based pipeline stage index.
extractor_id (str) – Extractor identifier for the stage.
status (str) – Stage status, extracted, skipped, or errored.
text (str or None) – Extracted text content, when produced.
text_characters (int) – Character count of the extracted text.
producer_extractor_id (str or None) – Extractor identifier that produced the text content.
source_stage_index (int or None) – Optional stage index that supplied the text for selection-style extractors.
confidence (float or None) – Optional confidence score from 0.0 to 1.0.
metadata (dict[str, Any]) – Optional structured metadata for passing data between pipeline stages.
error_type (str or None) – Optional error type name for errored stages.
error_message (str or None) – Optional error message for errored stages.

Parameters:

stage_index (Annotated[int, Ge(ge=1)])
extractor_id (str)
status (str)
text (str | None)
text_characters (Annotated[int, Ge(ge=0)])
producer_extractor_id (str | None)
source_stage_index (Annotated[int | None, Ge(ge=1)])
confidence (Annotated[float | None, Ge(ge=0.0), Le(le=1.0)])
metadata (Dict[str, Any])
error_type (str | None)
error_message (str | None)

confidence: float | None

error_message: str | None

error_type: str | None

extractor_id: str

metadata: Dict[str, Any]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

producer_extractor_id: str | None

source_stage_index: int | None

stage_index: int

status: str

text: str | None

text_characters: int

class biblicus.models.IngestResult(*, item_id, relpath, sha256)[source]

Minimal summary for an ingestion event.

Variables:

item_id (str) – Universally unique identifier assigned to the ingested item.
relpath (str) – Relative path to the raw item file.
sha256 (str) – Secure Hash Algorithm 256 digest of the stored bytes.

Parameters:

item_id (str)
relpath (str)
sha256 (str)

item_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

relpath: str

sha256: str

class biblicus.models.PipelineAnalysisConfig(*, kind, configuration)[source]

Analysis configuration for a pipeline recipe.

Variables:

kind (str) – Analysis kind identifier.
configuration (str) – Path to analysis configuration file.

Parameters:

kind (Annotated[str, MinLen(min_length=1)])
configuration (Annotated[str, MinLen(min_length=1)])

configuration: str

kind: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.models.PipelineCorpusSelector(*, path=None, collection=None, selector=None)[source]

Corpus selection for a pipeline recipe.

Variables:

path (str or None) – Optional corpus path.
collection (str or None) – Optional collection name or path.
selector (str or None) – Optional selector pattern for collection corpora.

Parameters:

path (str | None)
collection (str | None)
selector (str | None)

collection: str | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

path: str | None

selector: str | None

class biblicus.models.PipelineExtractionConfig(*, recipe)[source]

Extraction configuration for a pipeline recipe.

Variables:: recipe (str) – Path to extraction recipe YAML.
Parameters:: recipe (Annotated[str, MinLen(min_length=1)])

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

recipe: str

class biblicus.models.PipelineMirrorConfig(*, collection)[source]

Mirror configuration for a pipeline recipe.

Variables:: collection (str) – Collection path or name to mirror before running.
Parameters:: collection (Annotated[str, MinLen(min_length=1)])

collection: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.models.PipelineRecipeConfig(*, corpus, mirror=None, extraction=None, retrieval=None, analysis=None)[source]

Pipeline recipe configuration.

Variables:

corpus (PipelineCorpusSelector) – Corpus selection information.
mirror (PipelineMirrorConfig or None) – Optional mirror configuration.
extraction (PipelineExtractionConfig or None) – Optional extraction configuration.
retrieval (PipelineRetrievalConfig or None) – Optional retrieval configuration.
analysis (list[PipelineAnalysisConfig] or None) – Optional analysis configuration list.

Parameters:

corpus (PipelineCorpusSelector)
mirror (PipelineMirrorConfig | None)
extraction (PipelineExtractionConfig | None)
retrieval (PipelineRetrievalConfig | None)
analysis (List[PipelineAnalysisConfig] | None)

analysis: List[PipelineAnalysisConfig] | None

corpus: PipelineCorpusSelector

extraction: PipelineExtractionConfig | None

mirror: PipelineMirrorConfig | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

retrieval: PipelineRetrievalConfig | None

class biblicus.models.PipelineRetrievalConfig(*, retriever, configuration)[source]

Retrieval configuration for a pipeline recipe.

Variables:

retriever (str) – Retriever identifier.
configuration (str) – Path to retriever configuration file.

Parameters:

retriever (Annotated[str, MinLen(min_length=1)])
configuration (Annotated[str, MinLen(min_length=1)])

configuration: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

retriever: str

class biblicus.models.QueryBudget(*, max_total_items, offset=0, maximum_total_characters=None, max_items_per_source=None)[source]

Evidence selection budget for retrieval.

The budget constrains the returned evidence. It intentionally does not change how a backend scores candidates, only how many evidence items are selected and how much text is allowed through.

Variables:

max_total_items (int) – Maximum number of evidence items to return.
offset (int) – Number of ranked candidates to skip before selecting evidence. This enables simple pagination by re-running the same query with a higher offset.
maximum_total_characters (int or None) – Optional maximum total characters across evidence text.
max_items_per_source (int or None) – Optional cap per source uniform resource identifier.

Parameters:

max_total_items (Annotated[int, Ge(ge=1)])
offset (Annotated[int, Ge(ge=0)])
maximum_total_characters (Annotated[int | None, Ge(ge=1)])
max_items_per_source (Annotated[int | None, Ge(ge=1)])

max_items_per_source: int | None

max_total_items: int

maximum_total_characters: int | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

offset: int

class biblicus.models.RemoteCollectionPullResult(*, discovered=0, created=0, mirrored=0, archived=0, errored=0)[source]

Summary of a collection pull operation.

Variables:

discovered (int) – Number of discovered subfolders or partitions.
created (int) – Number of corpora created.
mirrored (int) – Number of corpora mirrored.
archived (int) – Number of corpora archived.
errored (int) – Number of errors.

Parameters:

discovered (Annotated[int, Ge(ge=0)])
created (Annotated[int, Ge(ge=0)])
mirrored (Annotated[int, Ge(ge=0)])
archived (Annotated[int, Ge(ge=0)])
errored (Annotated[int, Ge(ge=0)])

archived: int

created: int

discovered: int

errored: int

mirrored: int

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.models.RemoteCorpusCollectionConfig(*, schema_version, created_at, collection_name, source, discovery, corpus_root, auto_create=True, deletion_policy='archive')[source]

Configuration for a remote corpus collection.

Variables:

schema_version (int) – Version of the collection config schema.
created_at (str) – International Organization for Standardization 8601 timestamp.
collection_name (str) – Collection name.
source (RemoteCorpusSourceConfig) – Remote source configuration.
discovery (RemoteCorpusCollectionDiscovery) – Discovery configuration.
corpus_root (str) – Filesystem path to the corpus root directory.
auto_create (bool) – Whether to auto-create discovered corpora.
deletion_policy (str) – Policy for missing remote folders (archive or delete).

Parameters:

schema_version (Annotated[int, Ge(ge=1)])
created_at (str)
collection_name (Annotated[str, MinLen(min_length=1)])
source (RemoteCorpusSourceConfig)
discovery (RemoteCorpusCollectionDiscovery)
corpus_root (Annotated[str, MinLen(min_length=1)])
auto_create (bool)
deletion_policy (Annotated[str, MinLen(min_length=1)])

auto_create: bool

collection_name: str

corpus_root: str

created_at: str

deletion_policy: str

discovery: RemoteCorpusCollectionDiscovery

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

schema_version: int

source: RemoteCorpusSourceConfig

class biblicus.models.RemoteCorpusCollectionDiscovery(*, mode, depth=1, include_root_files=False)[source]

Discovery configuration for a remote collection.

Variables:

mode (str) – Discovery mode (subfolder or partition).
depth (int) – Subfolder depth to discover.
include_root_files (bool) – Whether to include root files under a reserved corpus.

Parameters:

mode (Annotated[str, MinLen(min_length=1)])
depth (Annotated[int, Ge(ge=1)])
include_root_files (bool)

depth: int

include_root_files: bool

mode: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.models.RemoteCorpusSourceConfig(*, kind, profile, name=None, bucket=None, container=None, prefix='')[source]

Configuration for a remote corpus source.

Variables:

kind (str) – Remote source kind (s3 or azure-blob).
profile (str) – Source profile name in user configuration.
name (str or None) – Optional local namespace for storage.
bucket (str or None) – S3 bucket name.
container (str or None) – Azure Blob container name.
prefix (str) – Optional remote prefix to scope the mirror.

Parameters:

kind (Annotated[str, MinLen(min_length=1)])
profile (Annotated[str, MinLen(min_length=1)])
name (str | None)
bucket (str | None)
container (str | None)
prefix (str)

bucket: str | None

container: str | None

kind: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None

prefix: str

profile: str

class biblicus.models.RemoteSourceItem(*, key, source_uri, etag=None, last_modified=None, size, content_type=None)[source]

Remote source object metadata.

Variables:

key (str) – Remote object key or blob name.
source_uri (str) – Source uniform resource identifier.
etag (str or None) – Optional entity tag for change detection.
last_modified (str or None) – Optional International Organization for Standardization 8601 timestamp.
size (int) – Size of the object in bytes.
content_type (str or None) – Optional media type.

Parameters:

key (str)
source_uri (str)
etag (str | None)
last_modified (str | None)
size (Annotated[int, Ge(ge=0)])
content_type (str | None)

content_type: str | None

etag: str | None

key: str

last_modified: str | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

size: int

source_uri: str

class biblicus.models.RemoteSourcePullResult(*, listed=0, downloaded=0, updated=0, skipped=0, pruned=0, errored=0)[source]

Summary of a remote source pull operation.

Variables:

listed (int) – Number of remote items listed.
downloaded (int) – Number of new items downloaded.
updated (int) – Number of existing items updated.
skipped (int) – Number of items skipped (no change).
pruned (int) – Number of local items pruned.
errored (int) – Number of items that failed to process.

Parameters:

listed (Annotated[int, Ge(ge=0)])
downloaded (Annotated[int, Ge(ge=0)])
updated (Annotated[int, Ge(ge=0)])
skipped (Annotated[int, Ge(ge=0)])
pruned (Annotated[int, Ge(ge=0)])
errored (Annotated[int, Ge(ge=0)])

downloaded: int

errored: int

listed: int

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pruned: int

skipped: int

updated: int

class biblicus.models.RetrievalResult(*, query_text, budget, snapshot_id, configuration_id, retriever_id, generated_at, evidence=<factory>, stats=<factory>)[source]

Retrieval result bundle returned from a retriever query.

Variables:

query_text (str) – Query text issued against the backend.
budget (QueryBudget) – Evidence selection budget applied to results.
snapshot_id (str) – Retrieval snapshot identifier.
configuration_id (str) – Configuration identifier used for this query.
retriever_id (str) – Retriever identifier used for this query.
generated_at (str) – International Organization for Standardization 8601 timestamp for the query result.
evidence (list[Evidence]) – Evidence objects selected under the budget.
stats (dict[str, Any]) – Backend-specific query statistics.

Parameters:

query_text (str)
budget (QueryBudget)
snapshot_id (str)
configuration_id (str)
retriever_id (str)
generated_at (str)
evidence (List[Evidence])
stats (Dict[str, Any])

budget: QueryBudget

configuration_id: str

evidence: List[Evidence]

generated_at: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query_text: str

retriever_id: str

snapshot_id: str

stats: Dict[str, Any]

class biblicus.models.RetrievalSnapshot(*, snapshot_id, configuration, corpus_uri, catalog_generated_at, created_at, snapshot_artifacts=<factory>, stats=<factory>)[source]

Immutable record of a retrieval snapshot.

Variables:

snapshot_id (str) – Unique snapshot identifier.
configuration (ConfigurationManifest) – Configuration manifest for this snapshot.
corpus_uri (str) – Canonical uniform resource identifier for the corpus root.
catalog_generated_at (str) – Catalog timestamp used for the snapshot.
created_at (str) – International Organization for Standardization 8601 timestamp for snapshot creation.
snapshot_artifacts (list[str]) – Relative paths to materialized artifacts.
stats (dict[str, Any]) – Retriever-specific snapshot statistics.

Parameters:

snapshot_id (str)
configuration (ConfigurationManifest)
corpus_uri (str)
catalog_generated_at (str)
created_at (str)
snapshot_artifacts (List[str])
stats (Dict[str, Any])

catalog_generated_at: str

configuration: ConfigurationManifest

corpus_uri: str

created_at: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_artifacts: List[str]

snapshot_id: str

stats: Dict[str, Any]

biblicus.models.parse_extraction_snapshot_reference(value)[source]

Parse an extraction snapshot reference in the form extractor_id:snapshot_id.

Parameters:: value (str) – Raw reference string.
Returns:: Parsed extraction snapshot reference.
Return type:: ExtractionSnapshotReference
Raises:: ValueError – If the reference is not well formed.

Shared retrieval helpers for Biblicus retrievers.

biblicus.retrieval.apply_budget(evidence, budget)[source]

Apply a query budget to a ranked evidence list.

Parameters:

evidence (Iterable[Evidence]) – Ranked evidence iterable (highest score first).
budget (QueryBudget) – Budget constraints to enforce.

Returns:

Evidence list respecting the budget.

Return type:

list[Evidence]

biblicus.retrieval.create_configuration_manifest(*, retriever_id, name, configuration, description=None)[source]

Create a deterministic configuration manifest from a retriever configuration.

Parameters:

retriever_id (str) – Retriever identifier for the configuration.
name (str) – Human-readable configuration name.
configuration (dict[str, Any]) – Retriever-specific configuration values.
description (str or None) – Optional configuration description.

Returns:

Deterministic configuration manifest.

Return type:

ConfigurationManifest

biblicus.retrieval.create_snapshot_manifest(corpus, *, configuration, stats, snapshot_artifacts=None)[source]

Create a retrieval snapshot manifest tied to the current catalog snapshot.

Parameters:

corpus (Corpus) – Corpus used to generate the snapshot.
configuration (ConfigurationManifest) – Configuration manifest for the snapshot.
stats (dict[str, Any]) – Retriever-specific snapshot statistics.
snapshot_artifacts (list[str] or None) – Optional relative paths to materialized artifacts.

Returns:

Snapshot manifest.

Return type:

RetrievalSnapshot

biblicus.retrieval.hash_text(text)[source]

Hash a text payload for provenance.

Parameters:: text (str) – Text to hash.
Returns:: Secure Hash Algorithm 256 hex digest.
Return type:: str

Context pack building for Biblicus.

A context pack is the text that your application sends to a large language model. Biblicus produces a context pack from structured retrieval results so that evidence remains a stable contract while context formatting remains an explicit policy surface.

class biblicus.context.CharacterBudget(*, max_characters)[source]

Character budget for a context pack.

Variables:: max_characters (int) – Maximum characters permitted for the final context pack text.
Parameters:: max_characters (Annotated[int, Ge(ge=1)])

max_characters: int

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.context.ContextPack(**data)[source]

Context pack derived from retrieval evidence.

Variables:

text (str) – Context pack text suitable for inclusion in a model call.
evidence_count (int) – Number of evidence blocks included in the context pack.
blocks (list[ContextPackBlock]) – Structured blocks that produced the context pack.

Parameters:

data (Any)

blocks: List['ContextPackBlock']

evidence_count: int

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

text: str

class biblicus.context.ContextPackBlock(*, evidence_item_id, text, metadata=None)[source]

A single context pack block derived from one evidence item.

Variables:

evidence_item_id (str) – Item identifier that produced this block.
text (str) – Text included in this block.
metadata (dict[str, object] or None) – Optional metadata included with the block.

Parameters:

evidence_item_id (Annotated[str, MinLen(min_length=1)])
text (Annotated[str, MinLen(min_length=1)])
metadata (Dict[str, object] | None)

evidence_item_id: str

metadata: Dict[str, object] | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

text: str

class biblicus.context.ContextPackPolicy(*, join_with='\n\n', ordering='rank', include_metadata=False, metadata_fields=None)[source]

Policy that controls how evidence becomes context pack text.

Variables:

join_with (str) – Separator inserted between evidence text blocks.
ordering (str) – Evidence ordering policy (rank, score, or source).
include_metadata (bool) – Whether to include evidence metadata lines in each block.
metadata_fields (list[str] or None) – Optional evidence metadata fields to include.

Parameters:

join_with (str)
ordering (Annotated[str, MinLen(min_length=1)])
include_metadata (bool)
metadata_fields (List[str] | None)

include_metadata: bool

join_with: str

metadata_fields: List[str] | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ordering: str

class biblicus.context.TokenBudget(*, max_tokens)[source]

Token budget for a context pack.

Variables:: max_tokens (int) – Maximum tokens permitted for the final context pack text.
Parameters:: max_tokens (Annotated[int, Ge(ge=1)])

max_tokens: int

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class biblicus.context.TokenCounter(*, tokenizer_id='naive-whitespace')[source]

Token counter configuration for token budget fitting.

This is a lightweight model wrapper so token fitting remains explicit and testable even when the underlying tokenizer is provided by an optional dependency.

Variables:: tokenizer_id (str) – Tokenizer identifier (for example, naive-whitespace).
Parameters:: tokenizer_id (Annotated[str, MinLen(min_length=1)])

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tokenizer_id: str

biblicus.context.build_context_pack(result, *, policy)[source]

Build a context pack from a retrieval result using an explicit policy.

Parameters:

result (RetrievalResult) – Retrieval result containing ranked evidence.
policy (ContextPackPolicy) – Policy controlling how evidence text is joined.

Returns:

Context pack containing concatenated evidence text.

Return type:

ContextPack

biblicus.context.count_tokens(text, *, tokenizer_id)[source]

Count tokens in a text using a tokenizer identifier.

The default tokenizer is naive-whitespace, which counts whitespace-separated tokens.

Parameters:

text (str) – Text payload to count.
tokenizer_id (str) – Tokenizer identifier.

Returns:

Token count.

Return type:

int

Raises:

KeyError – If the tokenizer identifier is unknown.

biblicus.context.fit_context_pack_to_character_budget(context_pack, *, policy, character_budget)[source]

Fit a context pack to a character budget by dropping trailing blocks.

Parameters:

context_pack (ContextPack) – Context pack to fit.
policy (ContextPackPolicy) – Policy controlling how blocks are joined into text.
character_budget (CharacterBudget) – Character budget to enforce.

Returns:

Fitted context pack.

Return type:

ContextPack

biblicus.context.fit_context_pack_to_token_budget(context_pack, *, policy, token_budget, token_counter=None)[source]

Fit a context pack to a token budget by dropping trailing blocks.

This function is deterministic. It never rewrites block text. It only removes blocks from the end of the block list until the token budget is met.

Parameters:

context_pack (ContextPack) – Context pack to fit.
policy (ContextPackPolicy) – Policy controlling how blocks are joined into text.
token_budget (TokenBudget) – Token budget to enforce.
token_counter (TokenCounter or None) – Optional token counter configuration.

Returns:

Fitted context pack.

Return type:

ContextPack

Evaluation and benchmarking tools for extraction pipelines.

This module provides tools for quantifying the performance of OCR and other extraction pipelines against ground truth data.

class biblicus.evaluation.BenchmarkConfig(benchmark_name, categories, pipelines, aggregate_weights, output_dir=PosixPath('results'))[source]

Configuration for a complete benchmark run.

Parameters:

benchmark_name (str)
categories (Dict[str, CategoryConfig])
pipelines (List[Path])
aggregate_weights (Dict[str, float])
output_dir (Path)

aggregate_weights: Dict[str, float]

benchmark_name: str

categories: Dict[str, CategoryConfig]

classmethod load(config_path)[source]

Load benchmark configuration from YAML file.

Parameters:: config_path (Path) – Path to configuration file.
Returns:: Loaded configuration.
Return type:: BenchmarkConfig

output_dir: Path = PosixPath('results')

pipelines: List[Path]

class biblicus.evaluation.BenchmarkReport(evaluation_timestamp, corpus_path, pipeline_configuration, total_documents, avg_precision, avg_recall, avg_f1, median_precision, median_recall, median_f1, min_f1, max_f1, avg_word_error_rate, avg_sequence_accuracy, avg_lcs_ratio, median_word_error_rate, median_sequence_accuracy, median_lcs_ratio, avg_bigram_overlap, avg_trigram_overlap, processing_time_seconds, per_document_results)[source]

Aggregate benchmark results across multiple documents.

Parameters:

evaluation_timestamp (str)
corpus_path (str)
pipeline_configuration (Dict[str, Any])
total_documents (int)
avg_precision (float)
avg_recall (float)
avg_f1 (float)
median_precision (float)
median_recall (float)
median_f1 (float)
min_f1 (float)
max_f1 (float)
avg_word_error_rate (float)
avg_sequence_accuracy (float)
avg_lcs_ratio (float)
median_word_error_rate (float)
median_sequence_accuracy (float)
median_lcs_ratio (float)
avg_bigram_overlap (float)
avg_trigram_overlap (float)
processing_time_seconds (float)
per_document_results (List[Dict[str, Any]])

avg_bigram_overlap: float

avg_f1: float

avg_lcs_ratio: float

avg_precision: float

avg_recall: float

avg_sequence_accuracy: float

avg_trigram_overlap: float

avg_word_error_rate: float

corpus_path: str

evaluation_timestamp: str

max_f1: float

median_f1: float

median_lcs_ratio: float

median_precision: float

median_recall: float

median_sequence_accuracy: float

median_word_error_rate: float

min_f1: float

per_document_results: List[Dict[str, Any]]

pipeline_configuration: Dict[str, Any]

print_summary()[source]: Print console summary.

processing_time_seconds: float

to_csv(path)[source]

Export per-document results as CSV.

Parameters:: path (Path)

to_json(path)[source]

Export report as JSON.

Parameters:: path (Path)

total_documents: int

class biblicus.evaluation.BenchmarkResult(benchmark_version='1.0.0', benchmark_name='', timestamp='', categories=<factory>, aggregate=<factory>, recommendations=<factory>, total_documents=0, total_processing_time_seconds=0.0)[source]

Complete benchmark results across all categories.

Parameters:

benchmark_version (str)
benchmark_name (str)
timestamp (str)
categories (Dict[str, CategoryResult])
aggregate (Dict[str, float])
recommendations (Dict[str, str])
total_documents (int)
total_processing_time_seconds (float)

aggregate: Dict[str, float]

benchmark_name: str = ''

benchmark_version: str = '1.0.0'

categories: Dict[str, CategoryResult]

print_summary()[source]

Print summary to console.

Return type:: None

recommendations: Dict[str, str]

timestamp: str = ''

to_json(path)[source]

Export results to JSON file.

Parameters:: path (Path)
Return type:: None

to_markdown(path)[source]

Export results to Markdown file.

Parameters:: path (Path)
Return type:: None

total_documents: int = 0

total_processing_time_seconds: float = 0.0

class biblicus.evaluation.BenchmarkRunner(config)[source]

Orchestrates multi-category benchmarking.

Usage:: config = BenchmarkConfig.load(“configs/benchmark/standard.yaml”) runner = BenchmarkRunner(config) results = runner.run_all() results.to_json(Path(“results/benchmark.json”))

Parameters:: config (BenchmarkConfig)

run_all()[source]

Run benchmark across all configured categories.

Returns:: Complete benchmark results.
Return type:: BenchmarkResult

run_category(cat_config)[source]

Run benchmark for a single category.

Parameters:: cat_config (CategoryConfig) – Category configuration.
Returns:: Category results.
Return type:: CategoryResult

class biblicus.evaluation.CategoryConfig(name, dataset, primary_metric, pipelines=<factory>, corpus_path=None, ground_truth_subdir=None, subset_size=None, tags=<factory>)[source]

Configuration for a single benchmark category.

Parameters:

name (str)
dataset (str)
primary_metric (str)
pipelines (List[object])
corpus_path (Path | None)
ground_truth_subdir (str | None)
subset_size (int | None)
tags (List[str])

corpus_path: Path | None = None

dataset: str

ground_truth_subdir: str | None = None

name: str

pipelines: List[object]

primary_metric: str

subset_size: int | None = None

tags: List[str]

class biblicus.evaluation.CategoryResult(category_name, dataset, documents_evaluated, pipelines, best_pipeline, best_score, primary_metric, primary_score, processing_time_seconds)[source]

Results for a single category.

Parameters:

category_name (str)
dataset (str)
documents_evaluated (int)
pipelines (List[Dict[str, Any]])
best_pipeline (str)
best_score (float)
primary_metric (str)
primary_score (float)
processing_time_seconds (float)

best_pipeline: str

best_score: float

category_name: str

dataset: str

documents_evaluated: int

pipelines: List[Dict[str, Any]]

primary_metric: str

primary_score: float

processing_time_seconds: float

class biblicus.evaluation.OCRBenchmark(corpus)[source]

Runs OCR evaluation across multiple documents.

Evaluates extraction snapshots against ground truth data and generates comprehensive reports with per-document and aggregate metrics.

Parameters:: corpus (Corpus)

evaluate_extraction(snapshot_reference, ground_truth_dir=None, pipeline_config=None)[source]

Evaluate an extraction snapshot against ground truth.

Args:

snapshot_reference: Snapshot ID or reference ground_truth_dir: Directory containing ground truth files

(defaults to corpus/metadata/funsd_ground_truth)

pipeline_config: Configuration used to create snapshot

Returns:

BenchmarkReport with detailed results

Parameters:

snapshot_reference (str)
ground_truth_dir (Path | None)
pipeline_config (Dict | None)

Return type:

BenchmarkReport

class biblicus.evaluation.OCREvaluationResult(document_id, image_path, ground_truth_text, extracted_text, precision, recall, f1_score, character_accuracy, true_positives, false_positives, false_negatives, word_count_gt, word_count_ocr, word_error_rate, sequence_accuracy, lcs_ratio, normalized_edit_distance, bigram_overlap, trigram_overlap)[source]

Results for evaluating a single document.

Parameters:

document_id (str)
image_path (str)
ground_truth_text (str)
extracted_text (str)
precision (float)
recall (float)
f1_score (float)
character_accuracy (float)
true_positives (int)
false_positives (int)
false_negatives (int)
word_count_gt (int)
word_count_ocr (int)
word_error_rate (float)
sequence_accuracy (float)
lcs_ratio (float)
normalized_edit_distance (float)
bigram_overlap (float)
trigram_overlap (float)

bigram_overlap: float

character_accuracy: float

document_id: str

extracted_text: str

f1_score: float

false_negatives: int

false_positives: int

ground_truth_text: str

image_path: str

lcs_ratio: float

normalized_edit_distance: float

precision: float

print_summary()[source]: Print a summary of this result.

recall: float

sequence_accuracy: float

to_dict()[source]

Convert to dictionary for serialization.

Return type:: Dict[str, Any]

trigram_overlap: float

true_positives: int

word_count_gt: int

word_count_ocr: int

word_error_rate: float

biblicus.evaluation.calculate_character_accuracy(ground_truth, extracted)[source]

Calculate character-level accuracy using edit distance.

Uses Levenshtein distance to compute how similar the strings are. Returns 1.0 - (distance / max_length).

Args:: ground_truth: Expected text extracted: Actual OCR output
Returns:: Accuracy between 0.0 and 1.0

Parameters:

ground_truth (str)
extracted (str)

Return type:

float

biblicus.evaluation.calculate_ngram_overlap(ground_truth, extracted, n=2)[source]

Calculate n-gram overlap to measure local word ordering.

N-grams capture short sequences of words. High n-gram overlap means the extracted text preserves local word ordering, even if global order differs.

Args:: ground_truth: Expected text extracted: Actual OCR output n: N-gram size (default 2 for bigrams)
Returns:: N-gram overlap ratio (0.0 to 1.0)

Parameters:

ground_truth (str)
extracted (str)
n (int)

Return type:

float

biblicus.evaluation.calculate_word_metrics(ground_truth, extracted)[source]

Calculate word-level precision, recall, and F1 score.

Compares word sets after normalization (lowercase, remove punctuation).

Args:: ground_truth: Expected text extracted: Actual OCR output
Returns:: Dictionary with precision, recall, f1_score, and counts

Parameters:

ground_truth (str)
extracted (str)

Return type:

Dict[str, Any]

biblicus.evaluation.calculate_word_order_metrics(ground_truth, extracted)[source]

Calculate order-aware metrics that measure reading sequence quality.

These metrics are critical for evaluating layout-aware OCR where the goal is to preserve correct reading order (e.g., left column before right column).

Metrics: - Word Error Rate (WER): Edit distance on word sequences (insertions, deletions, substitutions) - Sequence accuracy: What % of word sequences match exactly - Longest Common Subsequence (LCS): Longest sequence of words in correct order - Normalized edit distance: Word-level Levenshtein distance normalized by length

Args:: ground_truth: Expected text in correct reading order extracted: Actual OCR output
Returns:: Dictionary with order-aware metrics

Parameters:

ground_truth (str)
extracted (str)

Return type:

Dict[str, Any]

biblicus.evaluation.evaluate_snapshot(*, corpus, snapshot, dataset, budget)[source]

Evaluate a retrieval snapshot against a dataset.

Parameters:

corpus (Corpus) – Corpus associated with the snapshot.
snapshot (RetrievalSnapshot) – Retrieval snapshot manifest.
dataset (EvaluationDataset) – Evaluation dataset.
budget (QueryBudget) – Evidence selection budget.

Returns:

Evaluation result bundle.

Return type:

EvaluationResult

biblicus.evaluation.load_dataset(path)[source]

Load an evaluation dataset from JavaScript Object Notation.

Parameters:: path (Path) – Path to the dataset JavaScript Object Notation file.
Returns:: Parsed evaluation dataset.
Return type:: EvaluationDataset

Extraction

Text extraction snapshots for Biblicus.

class biblicus.extraction.ExtractionConfigurationManifest(*, configuration_id, extractor_id, name, created_at, configuration=<factory>)[source]

Reproducible configuration for an extraction plugin snapshot.

Variables:

configuration_id (str) – Deterministic configuration identifier.
extractor_id (str) – Extractor plugin identifier.
name (str) – Human-readable configuration name.
created_at (str) – International Organization for Standardization 8601 timestamp.
configuration (dict[str, Any]) – Extractor-specific configuration values.

Parameters:

configuration_id (str)
extractor_id (str)
name (str)
created_at (str)
configuration (Dict[str, Any])

configuration: Dict[str, Any]

configuration_id: str

created_at: str

extractor_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

class biblicus.extraction.ExtractionItemResult(*, item_id, status, final_text_relpath=None, final_metadata_relpath=None, final_stage_index=None, final_stage_extractor_id=None, final_producer_extractor_id=None, final_source_stage_index=None, error_type=None, error_message=None, stage_results=<factory>)[source]

Per-item result record for an extraction snapshot.

Variables:

item_id (str) – Item identifier.
status (str) – Final result status, extracted, skipped, or errored.
final_text_relpath (str or None) – Relative path to the final extracted text artifact, when extracted.
final_metadata_relpath (str or None) – Relative path to the final metadata artifact, when present.
final_stage_index (int or None) – Pipeline stage index that produced the final text.
final_stage_extractor_id (str or None) – Extractor identifier of the stage that produced the final text.
final_producer_extractor_id (str or None) – Extractor identifier that produced the final text content.
final_source_stage_index (int or None) – Optional stage index that supplied the final text for selection-style extractors.
error_type (str or None) – Optional error type name when no extracted text was produced.
error_message (str or None) – Optional error message when no extracted text was produced.
stage_results (list[ExtractionStageResult]) – Per-stage results recorded for this item.

Parameters:

item_id (str)
status (str)
final_text_relpath (str | None)
final_metadata_relpath (str | None)
final_stage_index (Annotated[int | None, Ge(ge=1)])
final_stage_extractor_id (str | None)
final_producer_extractor_id (str | None)
final_source_stage_index (Annotated[int | None, Ge(ge=1)])
error_type (str | None)
error_message (str | None)
stage_results (List[ExtractionStageResult])

error_message: str | None

error_type: str | None

final_metadata_relpath: str | None

final_producer_extractor_id: str | None

final_source_stage_index: int | None

final_stage_extractor_id: str | None

final_stage_index: int | None

final_text_relpath: str | None

item_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

stage_results: List[ExtractionStageResult]

status: str

class biblicus.extraction.ExtractionSnapshotManifest(*, snapshot_id, configuration, corpus_uri, catalog_generated_at, created_at, items=<factory>, stats=<factory>)[source]

Immutable record describing an extraction snapshot.

Variables:

snapshot_id (str) – Unique snapshot identifier.
configuration (ExtractionConfigurationManifest) – Configuration manifest for this snapshot.
corpus_uri (str) – Canonical uniform resource identifier for the corpus root.
catalog_generated_at (str) – Catalog timestamp used for the snapshot.
created_at (str) – International Organization for Standardization 8601 timestamp for snapshot creation.
items (list[ExtractionItemResult]) – Per-item results.
stats (dict[str, Any]) – Snapshot statistics.

Parameters:

snapshot_id (str)
configuration (ExtractionConfigurationManifest)
corpus_uri (str)
catalog_generated_at (str)
created_at (str)
items (List[ExtractionItemResult])
stats (Dict[str, Any])

catalog_generated_at: str

configuration: ExtractionConfigurationManifest

corpus_uri: str

created_at: str

items: List[ExtractionItemResult]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

stats: Dict[str, Any]

class biblicus.extraction.ExtractionStageResult(*, stage_index, extractor_id, status, text_relpath=None, text_characters=0, producer_extractor_id=None, source_stage_index=None, confidence=None, metadata_relpath=None, error_type=None, error_message=None)[source]

Per-item result record for a single pipeline stage.

Variables:

stage_index (int) – One-based pipeline stage index.
extractor_id (str) – Extractor identifier for the stage.
status (str) – Stage status, extracted, skipped, or errored.
text_relpath (str or None) – Relative path to the stage text artifact, when extracted.
text_characters (int) – Character count of the extracted text.
producer_extractor_id (str or None) – Extractor identifier that produced the text content.
source_stage_index (int or None) – Optional stage index that supplied the text for selection-style extractors.
confidence (float or None) – Optional confidence score from 0.0 to 1.0.
metadata_relpath (str or None) – Relative path to the stage metadata artifact, when present.
error_type (str or None) – Optional error type name for errored stages.
error_message (str or None) – Optional error message for errored stages.

Parameters:

stage_index (Annotated[int, Ge(ge=1)])
extractor_id (str)
status (str)
text_relpath (str | None)
text_characters (Annotated[int, Ge(ge=0)])
producer_extractor_id (str | None)
source_stage_index (Annotated[int | None, Ge(ge=1)])
confidence (Annotated[float | None, Ge(ge=0.0), Le(le=1.0)])
metadata_relpath (str | None)
error_type (str | None)
error_message (str | None)

confidence: float | None

error_message: str | None

error_type: str | None

extractor_id: str

metadata_relpath: str | None

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

producer_extractor_id: str | None

source_stage_index: int | None

stage_index: int

status: str

text_characters: int

text_relpath: str | None

biblicus.extraction.build_extraction_snapshot(corpus, *, extractor_id, configuration_name, configuration, force=False, max_workers=1)[source]

Build an extraction snapshot for a corpus using the pipeline extractor.

Parameters:

corpus (Corpus) – Corpus to extract from.
extractor_id (str) – Extractor plugin identifier (must be pipeline).
configuration_name (str) – Human-readable configuration name.
configuration (dict[str, Any]) – Extractor configuration mapping.
force (bool) – Whether to reprocess items even if artifacts already exist.
max_workers (int) – Maximum number of concurrent workers.

Returns:

Extraction snapshot manifest describing the build.

Return type:

ExtractionSnapshotManifest

Raises:

KeyError – If the extractor identifier is unknown.
ValueError – If the extractor configuration is invalid.
OSError – If the snapshot directory or artifacts cannot be written.
ExtractionSnapshotFatalError – If the extractor is not the pipeline.

biblicus.extraction.create_extraction_configuration_manifest(*, extractor_id, name, configuration)[source]

Create a deterministic extraction configuration manifest.

Parameters:

extractor_id (str) – Extractor plugin identifier.
name (str) – Human configuration name.
configuration (dict[str, Any]) – Extractor configuration.

Returns:

Configuration manifest.

Return type:

ExtractionConfigurationManifest

biblicus.extraction.create_extraction_snapshot_manifest(corpus, *, configuration)[source]

Create a new extraction snapshot manifest for a corpus.

Parameters:

corpus (Corpus) – Corpus associated with the snapshot.
configuration (ExtractionConfigurationManifest) – Configuration manifest.

Returns:

Snapshot manifest.

Return type:

ExtractionSnapshotManifest

biblicus.extraction.load_or_build_extraction_snapshot(corpus, *, extractor_id, configuration_name, configuration, max_workers=1)[source]

Load an extraction snapshot if it exists or build it when missing.

Parameters:

corpus (Corpus) – Corpus to extract from.
extractor_id (str) – Extractor plugin identifier (must be pipeline).
configuration_name (str) – Human-readable configuration name.
configuration (dict[str, Any]) – Extractor configuration mapping.
max_workers (int) – Maximum number of concurrent workers.

Returns:

Extraction snapshot manifest describing the build.

Return type:

ExtractionSnapshotManifest

biblicus.extraction.write_extracted_metadata_artifact(*, snapshot_dir, item, metadata)[source]

Write an extracted metadata artifact for an item into the snapshot directory.

Parameters:

snapshot_dir (Path) – Extraction snapshot directory.
item (CatalogItem) – Catalog item being extracted.
metadata (dict[str, Any]) – Metadata dictionary to persist.

Returns:

Relative path to the stored metadata artifact, or None if empty.

Return type:

str or None

biblicus.extraction.write_extracted_text_artifact(*, snapshot_dir, item, text)[source]

Write an extracted text artifact for an item into the snapshot directory.

Parameters:

snapshot_dir (Path) – Extraction snapshot directory.
item (CatalogItem) – Catalog item being extracted.
text (str) – Extracted text.

Returns:

Relative path to the stored text artifact.

Return type:

str

biblicus.extraction.write_extraction_latest_pointer(*, extractor_dir, manifest)[source]

Persist the latest pointer for an extractor.

Parameters:

extractor_dir (Path) – Extractor directory containing snapshots.
manifest (ExtractionSnapshotManifest) – Snapshot manifest used for the pointer.

Returns:

None.

Return type:

None

biblicus.extraction.write_extraction_snapshot_manifest(*, snapshot_dir, manifest)[source]

Persist an extraction snapshot manifest to a snapshot directory.

Parameters:

snapshot_dir (Path) – Extraction snapshot directory.
manifest (ExtractionSnapshotManifest) – Snapshot manifest to write.

Returns:

None.

Return type:

None

biblicus.extraction.write_pipeline_stage_metadata_artifact(*, snapshot_dir, stage_index, extractor_id, item, metadata)[source]

Write a pipeline stage metadata artifact for an item.

Parameters:

snapshot_dir (Path) – Extraction snapshot directory.
stage_index (int) – One-based pipeline stage index.
extractor_id (str) – Extractor identifier for the stage.
item (CatalogItem) – Catalog item being extracted.
metadata (dict[str, Any]) – Metadata dictionary to persist.

Returns:

Relative path to the stored stage metadata artifact, or None if empty.

Return type:

str or None

biblicus.extraction.write_pipeline_stage_text_artifact(*, snapshot_dir, stage_index, extractor_id, item, text)[source]

Write a pipeline stage text artifact for an item.

Parameters:

snapshot_dir (Path) – Extraction snapshot directory.
stage_index (int) – One-based pipeline stage index.
extractor_id (str) – Extractor identifier for the stage.
item (CatalogItem) – Catalog item being extracted.
text (str) – Extracted text content.

Returns:

Relative path to the stored stage text artifact.

Return type:

str

Text extraction plugins for Biblicus.

biblicus.extractors.get_extractor(extractor_id)[source]

Resolve a built-in text extractor by identifier.

Parameters:: extractor_id (str) – Extractor identifier.
Returns:: Extractor plugin instance.
Return type:: TextExtractor
Raises:: KeyError – If the extractor identifier is not known.

Graph

Graph extraction snapshots for Biblicus.

biblicus.graph.extraction.build_graph_snapshot(corpus, *, extractor_id, configuration_name, configuration, extraction_snapshot)[source]

Build a graph extraction snapshot for a corpus.

Parameters:

corpus (Corpus) – Corpus to process.
extractor_id (str) – Graph extractor identifier.
configuration_name (str) – Human configuration name.
configuration (dict[str, Any]) – Extractor configuration values.
extraction_snapshot (ExtractionSnapshotReference) – Extraction snapshot reference.

Returns:

Graph snapshot manifest.

Return type:

GraphSnapshotManifest

biblicus.graph.extraction.create_graph_configuration_manifest(*, extractor_id, name, configuration)[source]

Create a deterministic graph extraction configuration manifest.

Parameters:

extractor_id (str) – Graph extractor identifier.
name (str) – Human configuration name.
configuration (dict[str, Any]) – Extractor configuration.

Returns:

Configuration manifest.

Return type:

GraphConfigurationManifest

biblicus.graph.extraction.create_graph_id(*, extractor_id, configuration)[source]

Create a deterministic graph identifier from extractor and configuration.

Parameters:

extractor_id (str) – Graph extractor identifier.
configuration (dict[str, Any]) – Extractor configuration.

Returns:

Graph identifier.

Return type:

str

biblicus.graph.extraction.create_graph_snapshot_manifest(corpus, *, configuration, extraction_snapshot, graph_id)[source]

Create a new graph snapshot manifest for a corpus.

Parameters:

corpus (Corpus) – Corpus associated with the snapshot.
configuration (GraphConfigurationManifest) – Configuration manifest.
extraction_snapshot (ExtractionSnapshotReference) – Extraction snapshot reference.
graph_id (str) – Graph identifier.

Returns:

Graph snapshot manifest.

Return type:

GraphSnapshotManifest

biblicus.graph.extraction.latest_graph_snapshot_reference(corpus, *, extractor_id=None)[source]

Return the most recent graph snapshot reference.

Parameters:

corpus (Corpus) – Corpus containing the snapshots.
extractor_id (str or None) – Optional extractor identifier filter.

Returns:

Latest graph snapshot reference or None when no snapshots exist.

Return type:

GraphSnapshotReference or None

biblicus.graph.extraction.list_graph_snapshots(corpus, *, extractor_id=None)[source]

List graph snapshots stored under the corpus.

Parameters:

corpus (Corpus) – Corpus containing the snapshots.
extractor_id (str or None) – Optional extractor identifier filter.

Returns:

Summary list entries for each snapshot.

Return type:

list[GraphSnapshotListEntry]

biblicus.graph.extraction.load_graph_snapshot_manifest(corpus, *, extractor_id, snapshot_id)[source]

Load a graph snapshot manifest from the corpus.

Parameters:

corpus (Corpus) – Corpus containing the snapshot.
extractor_id (str) – Graph extractor identifier.
snapshot_id (str) – Graph snapshot identifier.

Returns:

Parsed snapshot manifest.

Return type:

GraphSnapshotManifest

Raises:

FileNotFoundError – If the manifest file does not exist.
ValueError – If the manifest data is invalid.

biblicus.graph.extraction.resolve_graph_snapshot_reference(corpus, *, raw)[source]

Resolve a graph snapshot reference from a raw string.

Parameters:

corpus (Corpus) – Corpus containing the snapshots.
raw (str) – Raw snapshot reference.

Returns:

Parsed graph snapshot reference.

Return type:

GraphSnapshotReference

biblicus.graph.extraction.write_graph_latest_pointer(*, extractor_dir, manifest)[source]

Persist the latest pointer for a graph extractor.

Parameters:

extractor_dir (Path) – Extractor directory containing snapshots.
manifest (GraphSnapshotManifest) – Snapshot manifest used for the pointer.

Returns:

None.

Return type:

None

biblicus.graph.extraction.write_graph_snapshot_manifest(*, snapshot_dir, manifest)[source]

Persist a graph snapshot manifest to a snapshot directory.

Parameters:

snapshot_dir (Path) – Graph snapshot directory.
manifest (GraphSnapshotManifest) – Snapshot manifest to write.

Returns:

None.

Return type:

None

Graph extraction models for Biblicus.

class biblicus.graph.models.GraphConfigurationManifest(*, configuration_id, extractor_id, name, created_at, configuration=<factory>)[source]

Reproducible configuration for a graph extraction snapshot.

Variables:

configuration_id (str) – Deterministic configuration identifier.
extractor_id (str) – Graph extractor identifier.
name (str) – Human-readable configuration name.
created_at (str) – International Organization for Standardization 8601 timestamp.
configuration (dict[str, Any]) – Extractor-specific configuration values.

Parameters:

configuration_id (str)
extractor_id (str)
name (str)
created_at (str)
configuration (Dict[str, Any])

configuration: Dict[str, Any]

configuration_id: str

created_at: str

extractor_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

class biblicus.graph.models.GraphEdge(*, schema_version=1, edge_id, src, dst, edge_type, weight=1.0, properties=<factory>)[source]

Edge record extracted from a corpus item.

Variables:

edge_id (str) – Deterministic edge identifier.
src (str) – Source node identifier.
dst (str) – Destination node identifier.
edge_type (str) – Edge type identifier.
weight (float) – Edge weight.
properties (dict[str, Any]) – Edge-specific properties.

Parameters:

schema_version (Annotated[int, Ge(ge=1)])
edge_id (Annotated[str, MinLen(min_length=1)])
src (Annotated[str, MinLen(min_length=1)])
dst (Annotated[str, MinLen(min_length=1)])
edge_type (Annotated[str, MinLen(min_length=1)])
weight (float)
properties (Dict[str, Any])

dst: str

edge_id: str

edge_type: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

properties: Dict[str, Any]

src: str

weight: float

class biblicus.graph.models.GraphExtractionItemSummary(*, item_id, node_count=0, edge_count=0, status, error_message=None)[source]

Summary record for a single item in a graph extraction snapshot.

Variables:

item_id (str) – Corpus item identifier.
node_count (int) – Number of nodes written for the item.
edge_count (int) – Number of edges written for the item.
status (str) – Result status.
error_message (str or None) – Optional error message.

Parameters:

item_id (Annotated[str, MinLen(min_length=1)])
node_count (Annotated[int, Ge(ge=0)])
edge_count (Annotated[int, Ge(ge=0)])
status (Annotated[str, MinLen(min_length=1)])
error_message (str | None)

edge_count: int

error_message: str | None

item_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

node_count: int

status: str

class biblicus.graph.models.GraphExtractionResult(*, schema_version=1, item_id, nodes=<factory>, edges=<factory>, metadata=<factory>)[source]

Graph extraction output for a single item.

Variables:

item_id (str) – Corpus item identifier.
nodes (list[GraphNode]) – Extracted graph nodes.
edges (list[GraphEdge]) – Extracted graph edges.
metadata (dict[str, Any]) – Extractor metadata.

Parameters:

schema_version (Annotated[int, Ge(ge=1)])
item_id (Annotated[str, MinLen(min_length=1)])
nodes (List[GraphNode])
edges (List[GraphEdge])
metadata (Dict[str, Any])

edges: List[GraphEdge]

item_id: str

metadata: Dict[str, Any]

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

nodes: List[GraphNode]

class biblicus.graph.models.GraphNode(*, schema_version=1, node_id, node_type, label, properties=<factory>)[source]

Node record extracted from a corpus item.

Variables:

node_id (str) – Deterministic node identifier.
node_type (str) – Node type identifier.
label (str) – Human-readable label.
properties (dict[str, Any]) – Node-specific properties.

Parameters:

schema_version (Annotated[int, Ge(ge=1)])
node_id (Annotated[str, MinLen(min_length=1)])
node_type (Annotated[str, MinLen(min_length=1)])
label (Annotated[str, MinLen(min_length=1)])
properties (Dict[str, Any])

label: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

node_id: str

node_type: str

properties: Dict[str, Any]

class biblicus.graph.models.GraphSchemaModel(*, schema_version=1)[source]

Base model for graph extraction schemas with strict validation.

Variables:: schema_version (int) – Graph schema version.
Parameters:: schema_version (Annotated[int, Ge(ge=1)])

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

schema_version: int

class biblicus.graph.models.GraphSnapshotListEntry(*, extractor_id, snapshot_id, graph_id, configuration_id, configuration_name, catalog_generated_at, created_at, stats=<factory>)[source]

Summary entry for a graph extraction snapshot stored in a corpus.

Variables:

extractor_id (str) – Graph extractor identifier.
snapshot_id (str) – Graph snapshot identifier.
graph_id (str) – Deterministic graph identifier.
configuration_id (str) – Deterministic configuration identifier.
configuration_name (str) – Human-readable configuration name.
catalog_generated_at (str) – Catalog timestamp used for the snapshot.
created_at (str) – International Organization for Standardization 8601 timestamp for snapshot creation.
stats (dict[str, Any]) – Snapshot statistics.

Parameters:

extractor_id (Annotated[str, MinLen(min_length=1)])
snapshot_id (Annotated[str, MinLen(min_length=1)])
graph_id (Annotated[str, MinLen(min_length=1)])
configuration_id (Annotated[str, MinLen(min_length=1)])
configuration_name (Annotated[str, MinLen(min_length=1)])
catalog_generated_at (Annotated[str, MinLen(min_length=1)])
created_at (Annotated[str, MinLen(min_length=1)])
stats (Dict[str, object])

catalog_generated_at: str

configuration_id: str

configuration_name: str

created_at: str

extractor_id: str

graph_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

stats: Dict[str, object]

class biblicus.graph.models.GraphSnapshotManifest(*, snapshot_id, graph_id, configuration, corpus_uri, catalog_generated_at, extraction_snapshot, created_at, stats=<factory>)[source]

Immutable record describing a graph extraction snapshot.

Variables:

snapshot_id (str) – Unique snapshot identifier.
graph_id (str) – Deterministic graph identifier.
configuration (GraphConfigurationManifest) – Configuration manifest for this snapshot.
corpus_uri (str) – Canonical uniform resource identifier for the corpus root.
catalog_generated_at (str) – Catalog timestamp used for the snapshot.
extraction_snapshot (str) – Extraction snapshot reference.
created_at (str) – International Organization for Standardization 8601 timestamp for snapshot creation.
stats (dict[str, Any]) – Snapshot statistics.

Parameters:

snapshot_id (str)
graph_id (str)
configuration (GraphConfigurationManifest)
corpus_uri (str)
catalog_generated_at (str)
extraction_snapshot (str)
created_at (str)
stats (Dict[str, Any])

catalog_generated_at: str

configuration: GraphConfigurationManifest

corpus_uri: str

created_at: str

extraction_snapshot: str

graph_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

stats: Dict[str, Any]

class biblicus.graph.models.GraphSnapshotReference(*, extractor_id, snapshot_id)[source]

Reference to a graph extraction snapshot.

Variables:

extractor_id (str) – Graph extractor identifier.
snapshot_id (str) – Graph snapshot identifier.

Parameters:

extractor_id (Annotated[str, MinLen(min_length=1)])
snapshot_id (Annotated[str, MinLen(min_length=1)])

as_string()[source]

Serialize the reference as a single string.

Returns:: Reference in the form extractor_id:snapshot_id.
Return type:: str

extractor_id: str

model_config = {'extra': 'forbid'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

snapshot_id: str

biblicus.graph.models.parse_graph_snapshot_reference(value)[source]

Parse a graph snapshot reference in the form extractor_id:snapshot_id.

Parameters:: value (str) – Raw reference string.
Returns:: Parsed graph snapshot reference.
Return type:: GraphSnapshotReference
Raises:: ValueError – If the reference is not well formed.

Neo4j graph storage helpers for Biblicus.

class biblicus.graph.neo4j.Neo4jSettings(uri, username, password, database, auto_start, container_name, docker_image, http_port, bolt_port)[source]

Configuration values for Neo4j connectivity and lifecycle.

Variables:

uri – Neo4j connection URI.
username – Neo4j username.
password – Neo4j password.
database – Optional Neo4j database name.
auto_start – Whether to auto-start Neo4j via Docker.
container_name – Docker container name for auto-start.
docker_image – Docker image for auto-start.
http_port – HTTP port for Neo4j UI.
bolt_port – Bolt port for Neo4j driver connections.

Parameters:

uri (str)
username (str)
password (str)
database (str | None)
auto_start (bool)
container_name (str)
docker_image (str)
http_port (int)
bolt_port (int)

auto_start: bool

bolt_port: int

container_name: str

database: str | None

docker_image: str

http_port: int

password: str

uri: str

username: str

biblicus.graph.neo4j.create_neo4j_driver(settings)[source]

Create a Neo4j driver, waiting for availability when auto-start is enabled.

Parameters:: settings (Neo4jSettings) – Resolved Neo4j settings.
Returns:: Neo4j driver instance.
Return type:: neo4j.Driver
Raises:: ValueError – If the Neo4j driver dependency is missing.

biblicus.graph.neo4j.ensure_neo4j_running(settings)[source]

Ensure the Neo4j container is running when auto-start is enabled.

Parameters:: settings (Neo4jSettings) – Resolved Neo4j settings.
Returns:: None.
Return type:: None
Raises:: ValueError – If Docker is unavailable or the container cannot be started.

biblicus.graph.neo4j.resolve_neo4j_settings(*, config=None)[source]

Resolve Neo4j settings from environment or user configuration.

Parameters:: config (BiblicusUserConfig or None) – Optional pre-loaded user configuration.
Returns:: Resolved Neo4j settings.
Return type:: Neo4jSettings

biblicus.graph.neo4j.write_graph_records(*, driver, settings, corpus_id, graph_id, extraction_snapshot, item_id, nodes, edges)[source]

Persist graph nodes and edges to Neo4j.

Parameters:

driver (neo4j.Driver) – Neo4j driver instance.
settings (Neo4jSettings) – Resolved Neo4j settings.
corpus_id (str) – Corpus identifier.
graph_id (str) – Graph identifier.
extraction_snapshot (str) – Extraction snapshot reference.
item_id (str) – Corpus item identifier.
nodes (Iterable[biblicus.graph.models.GraphNode]) – Iterable of graph nodes.
edges (Iterable[biblicus.graph.models.GraphEdge]) – Iterable of graph edges.

Returns:

None.

Return type:

None

Graph extractor registry for Biblicus.

biblicus.graph.extractors.available_graph_extractors()[source]

Return the registered graph extractors.

Returns:: Mapping of extractor identifiers to extractor classes.
Return type:: dict[str, Type[GraphExtractor]]

biblicus.graph.extractors.get_graph_extractor(extractor_id)[source]

Instantiate a graph extractor by identifier.

Parameters:: extractor_id (str) – Graph extractor identifier.
Returns:: Graph extractor instance.
Return type:: GraphExtractor
Raises:: KeyError – If the extractor identifier is unknown.

Application Programming Interface Reference

Core

Extraction

Graph

Backends