# Biblicus Architecture

Biblicus sits between raw, unstructured data and the moment you need reliable answers from it.
It is built for teams who receive large, messy corpora and must extract usable signals without
losing provenance or reproducibility. Retrieval-augmented generation is one use case, but the
system is broader than chatbots: it supports any pipeline that needs structured insight from
unstructured data.

At a high level the system does five things:

1. **Ingests** raw content into a corpus with minimal friction.
2. **Extracts** text from diverse media (documents, images, audio).
3. **Transforms** and annotates text with reusable LLM utilities.
4. **Retrieves** evidence through explicit, reproducible stages.
5. **Evaluates** results so improvements are measurable, not anecdotal.

The guiding idea is that every retrieval produces **evidence**: structured outputs with scores
and provenance that can be inspected, audited, and reused. Context packs, summaries, and downstream
generation are all derived from that evidence.

## Core Concepts

- **Corpus**: a named, mutable collection rooted at a path or uniform resource identifier. In
  version zero it is typically a local folder containing raw files plus a `metadata/` directory
  for minimal metadata.
- **Item**: the unit of ingestion in a corpus: raw bytes of any modality, including text, images,
  Portable Document Format documents, audio, and video, plus optional metadata and provenance.
- **Knowledge base backend**: an implementation that can ingest and retrieve from a corpus, such
  as scan, full text search, vector retrieval, or hybrid retrieval, exposed to procedures through
  retrieval primitives.
- **Retrieval configuration**: a named configuration bundle for a backend, such as chunking rules,
  embedding model and version, hybrid weights, reranker choice, and filters. This is what we
  benchmark and compare.
- **Configuration manifest**: a reproducibility record describing the backend and configuration parameters,
  plus any referenced snapshot artifacts and build snapshots.
- **Snapshot artifacts**: optional, persisted representations derived from raw content for a given
  configuration and backend, such as chunks, embeddings, or indexes. Some backends intentionally have
  none and operate on demand.
- **Evidence**: structured retrieval output from backend queries. Evidence includes spans, scores,
  and provenance used by downstream retrieval augmented generation procedures.
- **Pipeline stage / editorial layer**: a structured stage that transforms, filters, extracts, or
  curates content, such as raw, curated, and published, or extract text from Portable Document
  Format documents.

## Design Principles

- **Primitives + derived constructs**: keep the protocol surface small and composable; ship
  higher-level helpers and example procedures on top.
- **Composability definition**: composable means each stage has a small input and output contract,
  so you can connect stages in different orders without rewriting them.
- **Minimal opinion raw store**: raw ingestion should work for a folder of files with optional
  lightweight tagging.
- **Reproducibility by default**: comparisons require manifests (even when there are no persisted
  snapshot artifacts).
- **Mutability is real**: corpora are edited, pruned, and reorganized; re-indexing must be a core
  workflow.
- **Separation of concerns**: retrieval returns evidence; retrieval-augmented generation patterns
  live in Tactus procedures (not inside the knowledge base backend).
- **Deployment flexibility**: same interface across local/offline, brokered external services, and
  hybrid environments.
- **Evidence is the primary output**: every retrieval returns structured evidence; everything else
  is a derived helper.

## The Python Developer Mental Model

If this system is pleasant to use, a Python developer should be able to describe intent with the
core nouns:

- I have a **corpus** at this path or uniform resource identifier.
- I ingest an **item** with optional **metadata**.
- I rebuild the derived **index** after edits.
- I run a **configuration** against the same corpus.
- I query and receive **evidence**.

Anything that does not map cleanly to these nouns is either a derived helper or a backend-specific
implementation detail that should not leak.

## Evidence Lifecycle

Evidence flows through explicit stages and remains inspectable at every step:

1. **Retrieval**: backends return evidence with `stage` labels and scores.
2. **Processing**: optional reranking or filtering updates scores while preserving provenance.
3. **Context shaping**: context packs select and format evidence into model-ready text.
4. **Evaluation**: evaluation datasets compare evidence rankings to expectations.

At each stage, the output remains a structured object, so you can inspect, store, and compare
runs without re-running the entire pipeline.

## Relationship to Agent Frameworks

Biblicus integrates with agent frameworks through explicit tool interfaces. It does not hide
retrieval inside the model. Instead, it provides repeatable pipelines that expose *what* was
retrieved and *why*, so models can use evidence directly and safely.

- **Tools and toolsets**, including the Model Context Protocol, are the primary capability
  boundary.
- **Sandboxing and brokered or secretless execution** are primary deployment modes.
- **Durability and evaluations** are central: invariants via specifications, quality via
  evaluations.

## Where to go next

- Start with **corpus.md** and **extraction.md** to understand how raw content is ingested.
- Move to **retrieval.md** and **retrieval-evaluation.md** to see how evidence is produced and tested.
- Explore **topic-modeling.md** and **markov-analysis.md** if you need higher-level analysis tools.
- See **text-utilities.md** for reusable, AI-assisted text transformations.