Biblicus ======== .. image:: _static/Biblicus-logo.png :alt: Biblicus logo :align: right :width: 216 :class: docs-logo You have a big collection of files. You need answers about what is in it, and you want to be able to show it to an AI agent to talk about it. Biblicus is a Python toolkit that turns unstructured data into something you can manage, search, analyze, and reuse. It supports end-to-end pipelines: ingest and extract content from mixed file types, transform text with reusable AI utilities, retrieve evidence into model-friendly context, and evaluate results so improvements are measurable. Common, believable scenarios look like this: - You inherit a huge archive of emails and need to build an assistant that can answer questions while protecting sensitive details. - You receive a legal discovery dump of scanned PDFs and need to search, summarize, and trace evidence across thousands of pages. - You need to make a policy or rules folder usable inside an AI system even though the source materials are far larger than any model context window. If you have ever built a one-off script to “just get this data into shape”, Biblicus is the version of that work you can keep: structured, repeatable, and designed for experimentation. The project emphasizes evidence-first results, explicit retrieval stages, and evaluation so you can defend what was retrieved, from where, and why. Try It ------ If you want to decide quickly whether Biblicus is useful, run a tutorial. Each tutorial is a small script that produces concrete output, and each one is covered by behavior specs so we do not publish “works on my machine” examples. Three good starting points: **Notes to context** Take a handful of short notes and turn them into a token-budgeted context pack you can paste into a model request. See :doc:`Notes to Context Pack `. .. code-block:: bash python scripts/use_cases/notes_to_context_pack_demo.py \ --corpus corpora/tutorial_notes_to_context_pack \ --force **Folder search** Ingest a folder of text files, build a deterministic index, and retrieve evidence with provenance. See :doc:`Folder Search With Extraction `. .. code-block:: bash python scripts/use_cases/text_folder_search_demo.py \ --corpus corpora/tutorial_text_folder_search \ --force **Sensitive text marking (mock + real API)** Mark sensitive spans in a document using an agentic text utility. Run it in deterministic mock mode first, then switch to a real model when you are ready. See :doc:`Mark Sensitive Text for Redaction `. .. code-block:: bash python scripts/use_cases/text_redact_demo.py \ --corpus corpora/tutorial_text_redact \ --force \ --mock If your data has an inherent sequence (threads, transcripts, recurring phases), Markov analysis adds a sequence model on top of segmented text. See :doc:`Sequence Graph With Markov Analysis `. .. toctree:: :maxdepth: 1 :caption: Tutorials use-cases Learn the Concepts ------------------ If you want the mental model before running anything, start with data collections and extraction, then move into retrieval and analysis. The rest of the documentation expands those foundations into deeper evaluation and tooling. Biblicus uses a small set of domain terms in its docs. The most important one is *corpus* (plural: *corpora*): a managed folder on disk that holds raw items plus lightweight metadata. .. toctree:: :maxdepth: 1 :caption: Concepts corpus extraction retrieval analysis graph-extraction Core Building Blocks -------------------- These pages define the vocabulary and invariants that keep Biblicus stable across retrievers and configurations. Read them when you want to understand *what is a corpus*, *what is evidence*, and *what stays the same even when the pipeline changes*. .. toctree:: :maxdepth: 2 :caption: Core Building Blocks corpus-design corpus knowledge-base backends backends/index context-pack context-engine Extraction and Ingestion ------------------------ Extraction turns raw files into usable text. Biblicus supports pluggable extractors so you can mix plain text handling with OCR, document parsers, and speech-to-text pipelines. .. toctree:: :maxdepth: 2 :caption: Extraction and Ingestion extraction extraction-evaluation extractors/index stt Retrieval and Evaluation ------------------------ Retrieval is how you move from a large corpus to a compact, relevant evidence set. These pages cover baseline retrieval, hybrid strategies, and how to evaluate retrieval quality. .. toctree:: :maxdepth: 2 :caption: Retrieval and Evaluation retrieval retrieval-quality retrieval-evaluation embedding-retrieval chunking Analysis and Modeling --------------------- Analysis tools help you find structure inside large text corpora. Topic modeling provides a first pass at clustering themes. Markov analysis (Hidden Markov Models) adds sequence modeling to detect recurring phases in longer documents or conversations. .. toctree:: :maxdepth: 2 :caption: Analysis and Modeling profiling topic-modeling markov-analysis Toolbox ------- Reusable building blocks support common transformations in information pipelines. They are designed to be simple to invoke while enabling sophisticated behavior under the hood, so you can slot them into ETL-like workflows without building a custom agent every time. .. toctree:: :maxdepth: 2 :caption: Tools utilities text-utilities entity-removal text-extract text-slice text-annotate text-redact text-link Operations and Demos -------------------- Operational docs help you run the system, configure it, and reproduce examples. The demo guides are designed to be runnable end-to-end and serve as acceptance tests. .. toctree:: :maxdepth: 2 :caption: Operations and Demos demos user-configuration testing Reference --------- Reference material, design notes, and the API index live here. Use this section for deeper implementation details or when you want a catalog of features. .. toctree:: :maxdepth: 1 :caption: Reference feature-index roadmap architecture api