Scan Backend
Backend ID: scan
Category: Retrieval Backends
Overview
The scan backend is a naive full-scan retrieval implementation that searches all text items at query time without building a persistent index. It provides a simple baseline for retrieval evaluation and is suitable for small corpora or development workflows.
The scan backend tokenizes queries into terms, scores items by term frequency, and returns ranked evidence with snippet extraction. It requires no build time but scales linearly with corpus size.
Installation
The scan backend is included by default with Biblicus:
pip install biblicus
No additional dependencies or setup required.
When to Use
Good Use Cases
Small corpora (< 1000 items): Fast enough without indexing overhead
Development & testing: Immediate results without build step
Baseline comparisons: Simple reference for evaluating other backends
Ad-hoc exploration: Quick searches without commitment to index
Not Recommended For
Large corpora (> 10,000 items): Query time becomes prohibitive
Production applications: No persistent index, slow repeated queries
High-frequency queries: Every query re-scans entire corpus
Configuration
Config Schema
class ScanRecipeConfig(BaseModel):
snippet_characters: int = 400 # Maximum characters in snippets
extraction_snapshot: Optional[str] = None # Extraction run reference
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum characters to include in evidence snippets |
|
str |
|
Optional extraction snapshot reference (extractor_id:snapshot_id) |
Usage
Command Line
Basic Usage
# Build scan run (no artifacts created)
biblicus build my-corpus --backend scan
# Query the run
biblicus query my-corpus --query "search terms"
Custom Configuration
# Larger snippets
biblicus build my-corpus --backend scan \
--config snippet_characters=800
# With extraction snapshot
biblicus build my-corpus --backend scan \
--config extraction_snapshot=pdf-text:abc123
Configuration File
backend_id: scan
configuration_name: "Development scan"
config:
snippet_characters: 400
extraction_snapshot: null
biblicus build my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
from biblicus.backends import get_backend
from biblicus.models import QueryBudget
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Get scan backend
backend = get_backend("scan")
# Build run (no index created)
run = backend.build_run(
corpus,
configuration_name="Quick scan",
config={}
)
# Query
result = backend.query(
corpus,
run=run,
query_text="search terms",
budget=QueryBudget(max_total_items=10)
)
# Access evidence
for evidence in result.evidence:
print(f"{evidence.item_id}: {evidence.score}")
print(f" {evidence.text[:100]}...")
With Extraction Runs
# Extract text first
extraction_snapshot = corpus.extract_text(
extractor_id="pdf-text"
)
# Build scan with extraction
run = backend.build_run(
corpus,
configuration_name="Scan with extraction",
config={
"extraction_snapshot": f"pdf-text:{extraction_snapshot.snapshot_id}"
}
)
How It Works
Query Processing
Tokenization: Query text is lowercased and split into tokens
Scanning: All corpus items are loaded and scanned sequentially
Scoring: Items are scored by term frequency (count of query tokens in text)
Ranking: Scored items are sorted by score (descending), then by item ID
Snippet Extraction: First match location is found, snippet extracted around it
Budget Application: Top-ranked items are selected according to query budget
Scoring Algorithm
def score_item(item_text, query_tokens):
"""Score = sum of token frequencies in item text."""
lower_text = item_text.lower()
return sum(lower_text.count(token) for token in query_tokens)
Snippet Extraction
Finds first occurrence of any query token
Centers snippet around match location
Falls back to start of text if no match (shouldn’t happen with non-zero scores)
Truncates to
snippet_characterslength
Performance
Build Time
None: No index is created, build is instant
Only validates configuration and counts text items
Query Time
O(n): Linear scan of all corpus items
~1-2 seconds for 1000 items
~10-20 seconds for 10,000 items
Depends on item size and disk I/O
Memory Usage
Low: No persistent index stored
Only active item being processed is in memory
Disk Usage
None: No artifacts created (only snapshot manifest)
Examples
Quick Development Search
# Initialize and populate corpus
biblicus init demo-corpus
echo "Machine learning applications" > ml.txt
echo "Deep learning neural networks" > dl.txt
biblicus ingest demo-corpus ml.txt dl.txt
# Immediate search (no index build)
biblicus build demo-corpus --backend scan
biblicus query demo-corpus --query "learning"
Baseline Comparison
from biblicus import Corpus
from biblicus.backends import get_backend
corpus = Corpus.from_directory("test-corpus")
# Build with both backends
scan_backend = get_backend("scan")
fts_backend = get_backend("sqlite-full-text-search")
scan_run = scan_backend.build_run(corpus, configuration_name="Scan baseline", config={})
fts_run = fts_backend.build_run(corpus, configuration_name="FTS index", config={})
# Compare results
query = "neural networks"
budget = {"max_total_items": 10}
scan_result = scan_backend.query(corpus, run=scan_run, query_text=query, budget=budget)
fts_result = fts_backend.query(corpus, run=fts_run, query_text=query, budget=budget)
print(f"Scan returned {len(scan_result.evidence)} items")
print(f"FTS returned {len(fts_result.evidence)} items")
Ad-hoc Exploration
# No commitment to index structure
biblicus build corpus --backend scan
biblicus query corpus --query "term1"
biblicus query corpus --query "term2"
biblicus query corpus --query "term3"
# Switch to FTS when ready
biblicus build corpus --backend sqlite-full-text-search
Limitations
Scalability
Linear query time makes it unsuitable for large corpora
No optimization for repeated queries
Every query re-reads items from disk
Ranking Quality
Simple term frequency scoring (no TF-IDF, BM25, or semantic ranking)
No phrase matching or proximity scoring
Single-token matching only
Query Features
No support for boolean operators (AND, OR, NOT)
No wildcard or fuzzy matching
No field-specific queries
When to Upgrade
Consider switching to sqlite-full-text-search when:
Corpus exceeds 1000 items
Query time becomes noticeable (>2 seconds)
You need repeated queries on the same corpus
You want better ranking (BM25 algorithm)
Production deployment requires consistent performance
Error Handling
Missing Extraction Run
If configured extraction snapshot doesn’t exist:
FileNotFoundError: Missing extraction snapshot: pdf-text:abc123
Fix: Verify extraction snapshot ID or run extraction first.
Non-Text Items
Non-text items without extraction snapshot are skipped automatically. No error raised.
Statistics
Build run statistics:
{
"items": 1000,
"text_items": 850
}
Query result statistics:
{
"candidates": 42,
"returned": 10
}
See Also
Backends Overview - All available backends
backends.md - Backend implementation guide
extraction.md - Text extraction pipeline
Extractor Reference - Text extraction plugins