Biblicus
You have a big collection of files. You need answers about what is in it, and you want to be able to show it to an AI agent to talk about it.
Biblicus is a Python toolkit that turns unstructured data into something you can manage, search, analyze, and reuse. It supports end-to-end pipelines: ingest and extract content from mixed file types, transform text with reusable AI utilities, retrieve evidence into model-friendly context, and evaluate results so improvements are measurable.
Common, believable scenarios look like this:
You inherit a huge archive of emails and need to build an assistant that can answer questions while protecting sensitive details.
You receive a legal discovery dump of scanned PDFs and need to search, summarize, and trace evidence across thousands of pages.
You need to make a policy or rules folder usable inside an AI system even though the source materials are far larger than any model context window.
If you have ever built a one-off script to “just get this data into shape”, Biblicus is the version of that work you can keep: structured, repeatable, and designed for experimentation.
The project emphasizes evidence-first results, explicit retrieval stages, and evaluation so you can defend what was retrieved, from where, and why.
Try It
If you want to decide quickly whether Biblicus is useful, run a tutorial. Each tutorial is a small script that produces concrete output, and each one is covered by behavior specs so we do not publish “works on my machine” examples.
Three good starting points:
Notes to context
Take a handful of short notes and turn them into a token-budgeted context pack you can paste into a model request.
python scripts/use_cases/notes_to_context_pack_demo.py \
--corpus corpora/tutorial_notes_to_context_pack \
--force
Folder search
Ingest a folder of text files, build a deterministic index, and retrieve evidence with provenance.
See Folder Search With Extraction.
python scripts/use_cases/text_folder_search_demo.py \
--corpus corpora/tutorial_text_folder_search \
--force
Sensitive text marking (mock + real API)
Mark sensitive spans in a document using an agentic text utility. Run it in deterministic mock mode first, then switch to a real model when you are ready.
See Mark Sensitive Text for Redaction.
python scripts/use_cases/text_redact_demo.py \
--corpus corpora/tutorial_text_redact \
--force \
--mock
If your data has an inherent sequence (threads, transcripts, recurring phases), Markov analysis adds a sequence model on top of segmented text. See Sequence Graph With Markov Analysis.
Tutorials
Learn the Concepts
If you want the mental model before running anything, start with data collections and extraction, then move into retrieval and analysis. The rest of the documentation expands those foundations into deeper evaluation and tooling.
Biblicus uses a small set of domain terms in its docs. The most important one is corpus (plural: corpora): a managed folder on disk that holds raw items plus lightweight metadata.
Core Building Blocks
These pages define the vocabulary and invariants that keep Biblicus stable across retrievers and configurations. Read them when you want to understand what is a corpus, what is evidence, and what stays the same even when the pipeline changes.
Core Building Blocks
- Corpus design
- What exists today
- Core vocabulary for this document
- Day to day corpus workflows
- Decision points with options and recommendations
- Locked decisions
- Lifecycle hooks and where plugins can attach
- Outcomes and remaining questions
- First behavior driven development slices implemented in version zero
- Reproducibility checklist
- Common pitfalls
- Knowledge base
- Adding a Retrieval Backend
- Retrieval Backends
- Scan Backend
- SQLite Full-Text Search Backend
- TF Vector backend
- Embedding index (in-memory)
- Embedding index (file-backed)
- Available Backends
- Quick Start
- Choosing a Backend
- Reproducibility checklist
- Common pitfalls
- Performance Comparison
- Common Patterns
- Backend Configuration
- Architecture
- Implementing Custom Backends
- See Also
- Context packs
- Context Engine
Extraction and Ingestion
Extraction turns raw files into usable text. Biblicus supports pluggable extractors so you can mix plain text handling with OCR, document parsers, and speech-to-text pipelines.
Extraction and Ingestion
- Text Extraction Pipeline
- What extraction produces
- Reproducibility checklist
- Available Extractors
- How selection chooses text
- Pipeline extractor
- Complementary versus competing extractors
- Example: extract from a corpus
- Example: selection within a pipeline
- Example: PDF with OCR fallback
- Example: VLM for complex documents
- Inspecting and deleting extraction snapshots
- Common pitfalls
- Use extracted text in retrieval
- Evaluate extraction quality
- What extraction is not
- Extraction evaluation
- Text Extractors
Retrieval and Evaluation
Retrieval is how you move from a large corpus to a compact, relevant evidence set. These pages cover baseline retrieval, hybrid strategies, and how to evaluate retrieval quality.
Retrieval and Evaluation
Analysis and Modeling
Analysis tools help you find structure inside large text corpora. Topic modeling provides a first pass at clustering themes. Markov analysis (Hidden Markov Models) adds sequence modeling to detect recurring phases in longer documents or conversations.
Analysis and Modeling
Toolbox
Reusable building blocks support common transformations in information pipelines. They are designed to be simple to invoke while enabling sophisticated behavior under the hood, so you can slot them into ETL-like workflows without building a custom agent every time.
Operations and Demos
Operational docs help you run the system, configure it, and reproduce examples. The demo guides are designed to be runnable end-to-end and serve as acceptance tests.
Operations and Demos
Reference
Reference material, design notes, and the API index live here. Use this section for deeper implementation details or when you want a catalog of features.