SQLite Full-Text Search Backend

Backend ID: sqlite-full-text-search

Category: Retrieval Backends

Overview

The SQLite full-text search backend provides production-ready retrieval using SQLite’s FTS5 extension with BM25 ranking algorithm. It builds a persistent index with configurable chunking and delivers fast queries even for large corpora.

This backend is ideal for production applications, offering sub-second query times after a one-time indexing step. It supports text chunking with overlap for better context matching and uses BM25 scoring for relevance ranking.

Installation

The backend is included with Biblicus by default:

pip install biblicus

Requirements

  • Python: 3.9+

  • SQLite: With FTS5 support (included in most Python builds)

Verify FTS5 Support

import sqlite3
conn = sqlite3.connect(':memory:')
try:
    conn.execute("CREATE VIRTUAL TABLE test USING fts5(content)")
    print("FTS5 is available")
except sqlite3.OperationalError:
    print("FTS5 is NOT available")

If FTS5 is not available, you may need to rebuild Python with a newer SQLite version.

When to Use

Good Use Cases

  • Production applications: Fast, consistent query performance

  • Medium to large corpora (1,000+ items): Essential for acceptable query times

  • Repeated queries: Persistent index avoids re-scanning

  • Better ranking: BM25 algorithm provides quality relevance scores

  • Local deployment: No external services required

Configuration

Config Schema

class SqliteFullTextSearchRecipeConfig(BaseModel):
    chunk_size: int = 800                  # Maximum characters per chunk
    chunk_overlap: int = 200               # Overlap between chunks
    snippet_characters: int = 400          # Maximum snippet length
    extraction_snapshot: Optional[str] = None   # Extraction run reference

Configuration Options

Option

Type

Default

Description

chunk_size

int

800

Maximum characters per chunk (must be >= 1)

chunk_overlap

int

200

Overlap characters between chunks (must be < chunk_size)

snippet_characters

int

400

Maximum characters in evidence snippets

extraction_snapshot

str

None

Optional extraction snapshot reference (extractor_id:snapshot_id)

Chunking Strategy

Text is split into overlapping chunks:

Text: "ABCDEFGHIJ"
chunk_size: 6
chunk_overlap: 2

Chunks:
  [ABCDEF]
      [EFGHIJ]

Why chunking?

  • Improves match granularity for large documents

  • Provides better context in snippets

  • Enables ranking at sub-document level

Tuning tips:

  • Larger chunks: Better context, fewer chunks, slower queries

  • Smaller chunks: More precise matching, more chunks, faster queries

  • Overlap: Prevents matches being split across chunk boundaries

Usage

Command Line

Basic Usage

# Build FTS index (creates SQLite database)
biblicus build my-corpus --backend sqlite-full-text-search

# Query the index
biblicus query my-corpus --query "search terms"

Custom Configuration

# Larger chunks with more overlap
biblicus build my-corpus --backend sqlite-full-text-search \
  --config chunk_size=1200 \
  --config chunk_overlap=300 \
  --config snippet_characters=600

# With extraction snapshot
biblicus build my-corpus --backend sqlite-full-text-search \
  --config extraction_snapshot=pdf-text:abc123

Configuration File

backend_id: sqlite-full-text-search
configuration_name: "Production FTS index"
config:
  chunk_size: 800
  chunk_overlap: 200
  snippet_characters: 400
  extraction_snapshot: null
biblicus build my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus
from biblicus.backends import get_backend
from biblicus.models import QueryBudget

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Get backend
backend = get_backend("sqlite-full-text-search")

# Build index
run = backend.build_run(
    corpus,
    configuration_name="Production index",
    config={
        "chunk_size": 800,
        "chunk_overlap": 200
    }
)

print(f"Built index: {run.snapshot_id}")
print(f"Stats: {run.stats}")

# Query
result = backend.query(
    corpus,
    run=run,
    query_text="machine learning",
    budget=QueryBudget(max_total_items=10)
)

# Access evidence
for evidence in result.evidence:
    print(f"\n{evidence.item_id} (score: {evidence.score:.2f})")
    print(f"  {evidence.text[:100]}...")

With Extraction Runs

# Extract text from PDFs
extraction_snapshot = corpus.extract_text(
    extractor_id="pdf-text"
)

# Build FTS with extraction
run = backend.build_run(
    corpus,
    configuration_name="FTS with PDF extraction",
    config={
        "extraction_snapshot": f"pdf-text:{extraction_snapshot.snapshot_id}",
        "chunk_size": 1000
    }
)

How It Works

Index Building

  1. Load corpus catalog: Read all item metadata

  2. Create SQLite database: Initialize FTS5 virtual table

  3. Process items: For each text item:

    • Load text content (raw or extracted)

    • Split into overlapping chunks

    • Insert chunks into FTS5 table with metadata

  4. Commit: Write database to disk

  5. Record stats: Count items, chunks, bytes

Query Processing

  1. Parse query: SQLite FTS5 query syntax

  2. Execute FTS5 search: Use BM25 ranking

  3. Fetch candidates: Retrieve top N chunks (5x budget)

  4. Extract snippets: Truncate chunks to snippet length

  5. Sort by score: Rank evidence by BM25 score

  6. Apply budget: Select top items according to budget

BM25 Ranking

SQLite FTS5 uses the BM25 algorithm:

  • Term frequency (TF): How often query terms appear in chunk

  • Inverse document frequency (IDF): Rarity of terms in corpus

  • Document length normalization: Adjusts for chunk size

BM25 provides better ranking than simple term frequency by considering term rarity and document length.

Performance

Build Time

  • O(n): Linear in corpus size

  • ~5-10 seconds for 1,000 items

  • ~50-100 seconds for 10,000 items

  • Depends on item size and chunk settings

Query Time

  • O(log n): Logarithmic with index

  • <100ms for most queries on 10,000-item corpus

  • Faster than scan by 100-1000x for large corpora

Memory Usage

  • Moderate: SQLite index held in memory during queries

  • ~1-5 MB per 1,000 text items

  • Configurable SQLite cache size

Disk Usage

  • ~1-5 MB per 1,000 items: Depends on text density and chunking

  • Stored in retrieval/sqlite-full-text-search/<snapshot_id>/<snapshot_id>.sqlite

Examples

Production Deployment

# Build index once
biblicus init prod-corpus
biblicus ingest prod-corpus documents/
biblicus extract prod-corpus --extractor pdf-text
biblicus build prod-corpus --backend sqlite-full-text-search \
  --config extraction_snapshot=pdf-text:latest

# Fast repeated queries
biblicus query prod-corpus --query "customer retention"
biblicus query prod-corpus --query "revenue model"
biblicus query prod-corpus --query "market analysis"

Tuned for Large Documents

# Larger chunks for academic papers
backend = get_backend("sqlite-full-text-search")

run = backend.build_run(
    corpus,
    configuration_name="Academic papers index",
    config={
        "chunk_size": 1500,      # Larger chunks
        "chunk_overlap": 400,    # More overlap
        "snippet_characters": 800  # Larger snippets
    }
)

Multi-Format Corpus

# Extract from multiple sources
biblicus extract corpus --extractor select-text \
  --config extractors='["pdf-text","markitdown","ocr-rapidocr"]'

# Build unified index
biblicus build corpus --backend sqlite-full-text-search \
  --config extraction_snapshot=select-text:latest

Query with Context

result = backend.query(
    corpus,
    run=run,
    query_text="machine learning applications",
    budget=QueryBudget(
        max_total_items=5,
        max_items_per_source=2
    )
)

# Evidence includes span information
for evidence in result.evidence:
    if evidence.span_start is not None:
        print(f"Match at offset {evidence.span_start}-{evidence.span_end}")
    print(f"  {evidence.text}")

Advanced Configuration

SQLite Query Syntax

FTS5 supports advanced query operators:

# Phrase search
biblicus query corpus --query '"machine learning"'

# Boolean AND
biblicus query corpus --query 'machine AND learning'

# Boolean OR
biblicus query corpus --query 'machine OR artificial'

# Boolean NOT
biblicus query corpus --query 'learning NOT supervised'

# Prefix matching
biblicus query corpus --query 'comput*'

# Column-specific (if indexed)
biblicus query corpus --query 'content:machine'

Rebuilding Indexes

Index must be rebuilt when:

  • Corpus content changes

  • New items ingested

  • Extraction runs updated

# Delete old run
biblicus extract delete corpus --run sqlite-full-text-search:old_id --confirm sqlite-full-text-search:old_id

# Build new run
biblicus build corpus --backend sqlite-full-text-search

Limitations

Query Features

  • No semantic/embedding-based search

  • No wildcard in middle of words (only prefix: word*)

  • No fuzzy matching (typo tolerance)

  • No field weighting (all text treated equally)

Scalability

  • Single-machine only (no distributed search)

  • Index size grows with corpus size

  • Rebuild required for content updates

Ranking

  • BM25 is keyword-based (not semantic)

  • No learning-to-rank or custom scoring

  • Fixed ranking algorithm

When to Upgrade

Consider external search engines when:

  • Corpus exceeds 100,000 items

  • Need distributed/clustered search

  • Require semantic/vector search

  • Need real-time index updates

  • Want advanced ranking (learning-to-rank)

External options: Elasticsearch, OpenSearch, Meilisearch, Typesense

Error Handling

FTS5 Not Available

RuntimeError: SQLite full-text search version five is required but not available

Fix: Rebuild Python with newer SQLite or use pre-built binaries with FTS5.

Missing Extraction Run

FileNotFoundError: Missing extraction snapshot: pdf-text:abc123

Fix: Verify extraction snapshot exists or run extraction first.

Invalid Configuration

ValueError: chunk_overlap must be smaller than chunk_size

Fix: Ensure chunk_overlap < chunk_size.

Statistics

Build run statistics:

{
  "items": 1000,
  "text_items": 850,
  "chunks": 3200,
  "bytes": 4567890
}

Query result statistics:

{
  "candidates": 42,
  "returned": 10
}

Index Artifacts

The backend creates a SQLite database stored at:

corpus/
  retrieval/
    sqlite-full-text-search/
      <snapshot_id>/
        <snapshot_id>.sqlite  # FTS5 index database

Database schema:

CREATE VIRTUAL TABLE chunks_full_text_search USING fts5(
    content,           -- Chunk text (indexed)
    item_id UNINDEXED, -- Corpus item ID
    source_uri UNINDEXED,
    media_type UNINDEXED,
    relpath UNINDEXED,
    title UNINDEXED,
    start_offset UNINDEXED,
    end_offset UNINDEXED
);

See Also