# Scan Backend

**Backend ID:** `scan`

**Category:** [Retrieval Backends](index.md)

## Overview

The scan backend is a naive full-scan retrieval implementation that searches all text items at query time without building a persistent index. It provides a simple baseline for retrieval evaluation and is suitable for small corpora or development workflows.

The scan backend tokenizes queries into terms, scores items by term frequency, and returns ranked evidence with snippet extraction. It requires no build time but scales linearly with corpus size.

## Installation

The scan backend is included by default with Biblicus:

```bash
pip install biblicus
```

No additional dependencies or setup required.

## When to Use

### Good Use Cases

- **Small corpora** (< 1000 items): Fast enough without indexing overhead
- **Development & testing**: Immediate results without build step
- **Baseline comparisons**: Simple reference for evaluating other backends
- **Ad-hoc exploration**: Quick searches without commitment to index

### Not Recommended For

- **Large corpora** (> 10,000 items): Query time becomes prohibitive
- **Production applications**: No persistent index, slow repeated queries
- **High-frequency queries**: Every query re-scans entire corpus

## Configuration

### Config Schema

```python
class ScanRecipeConfig(BaseModel):
    snippet_characters: int = 400       # Maximum characters in snippets
    extraction_snapshot: Optional[str] = None  # Extraction run reference
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `snippet_characters` | int | `400` | Maximum characters to include in evidence snippets |
| `extraction_snapshot` | str | `None` | Optional extraction snapshot reference (extractor_id:snapshot_id) |

## Usage

### Command Line

#### Basic Usage

```bash
# Build scan run (no artifacts created)
biblicus build my-corpus --backend scan

# Query the run
biblicus query my-corpus --query "search terms"
```

#### Custom Configuration

```bash
# Larger snippets
biblicus build my-corpus --backend scan \
  --config snippet_characters=800

# With extraction snapshot
biblicus build my-corpus --backend scan \
  --config extraction_snapshot=pdf-text:abc123
```

#### Configuration File

```yaml
backend_id: scan
configuration_name: "Development scan"
config:
  snippet_characters: 400
  extraction_snapshot: null
```

```bash
biblicus build my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus
from biblicus.backends import get_backend
from biblicus.models import QueryBudget

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Get scan backend
backend = get_backend("scan")

# Build run (no index created)
run = backend.build_run(
    corpus,
    configuration_name="Quick scan",
    config={}
)

# Query
result = backend.query(
    corpus,
    run=run,
    query_text="search terms",
    budget=QueryBudget(max_total_items=10)
)

# Access evidence
for evidence in result.evidence:
    print(f"{evidence.item_id}: {evidence.score}")
    print(f"  {evidence.text[:100]}...")
```

### With Extraction Runs

```python
# Extract text first
extraction_snapshot = corpus.extract_text(
    extractor_id="pdf-text"
)

# Build scan with extraction
run = backend.build_run(
    corpus,
    configuration_name="Scan with extraction",
    config={
        "extraction_snapshot": f"pdf-text:{extraction_snapshot.snapshot_id}"
    }
)
```

## How It Works

### Query Processing

1. **Tokenization**: Query text is lowercased and split into tokens
2. **Scanning**: All corpus items are loaded and scanned sequentially
3. **Scoring**: Items are scored by term frequency (count of query tokens in text)
4. **Ranking**: Scored items are sorted by score (descending), then by item ID
5. **Snippet Extraction**: First match location is found, snippet extracted around it
6. **Budget Application**: Top-ranked items are selected according to query budget

### Scoring Algorithm

```python
def score_item(item_text, query_tokens):
    """Score = sum of token frequencies in item text."""
    lower_text = item_text.lower()
    return sum(lower_text.count(token) for token in query_tokens)
```

### Snippet Extraction

- Finds first occurrence of any query token
- Centers snippet around match location
- Falls back to start of text if no match (shouldn't happen with non-zero scores)
- Truncates to `snippet_characters` length

## Performance

### Build Time

- **None**: No index is created, build is instant
- Only validates configuration and counts text items

### Query Time

- **O(n)**: Linear scan of all corpus items
- ~1-2 seconds for 1000 items
- ~10-20 seconds for 10,000 items
- Depends on item size and disk I/O

### Memory Usage

- **Low**: No persistent index stored
- Only active item being processed is in memory

### Disk Usage

- **None**: No artifacts created (only snapshot manifest)

## Examples

### Quick Development Search

```bash
# Initialize and populate corpus
biblicus init demo-corpus
echo "Machine learning applications" > ml.txt
echo "Deep learning neural networks" > dl.txt
biblicus ingest demo-corpus ml.txt dl.txt

# Immediate search (no index build)
biblicus build demo-corpus --backend scan
biblicus query demo-corpus --query "learning"
```

### Baseline Comparison

```python
from biblicus import Corpus
from biblicus.backends import get_backend

corpus = Corpus.from_directory("test-corpus")

# Build with both backends
scan_backend = get_backend("scan")
fts_backend = get_backend("sqlite-full-text-search")

scan_run = scan_backend.build_run(corpus, configuration_name="Scan baseline", config={})
fts_run = fts_backend.build_run(corpus, configuration_name="FTS index", config={})

# Compare results
query = "neural networks"
budget = {"max_total_items": 10}

scan_result = scan_backend.query(corpus, run=scan_run, query_text=query, budget=budget)
fts_result = fts_backend.query(corpus, run=fts_run, query_text=query, budget=budget)

print(f"Scan returned {len(scan_result.evidence)} items")
print(f"FTS returned {len(fts_result.evidence)} items")
```

### Ad-hoc Exploration

```bash
# No commitment to index structure
biblicus build corpus --backend scan
biblicus query corpus --query "term1"
biblicus query corpus --query "term2"
biblicus query corpus --query "term3"

# Switch to FTS when ready
biblicus build corpus --backend sqlite-full-text-search
```

## Limitations

### Scalability

- Linear query time makes it unsuitable for large corpora
- No optimization for repeated queries
- Every query re-reads items from disk

### Ranking Quality

- Simple term frequency scoring (no TF-IDF, BM25, or semantic ranking)
- No phrase matching or proximity scoring
- Single-token matching only

### Query Features

- No support for boolean operators (AND, OR, NOT)
- No wildcard or fuzzy matching
- No field-specific queries

## When to Upgrade

Consider switching to [sqlite-full-text-search](sqlite-full-text-search.md) when:

- Corpus exceeds 1000 items
- Query time becomes noticeable (>2 seconds)
- You need repeated queries on the same corpus
- You want better ranking (BM25 algorithm)
- Production deployment requires consistent performance

## Error Handling

### Missing Extraction Run

If configured extraction snapshot doesn't exist:

```
FileNotFoundError: Missing extraction snapshot: pdf-text:abc123
```

**Fix**: Verify extraction snapshot ID or run extraction first.

### Non-Text Items

Non-text items without extraction snapshot are skipped automatically. No error raised.

## Statistics

Build run statistics:

```json
{
  "items": 1000,
  "text_items": 850
}
```

Query result statistics:

```json
{
  "candidates": 42,
  "returned": 10
}
```

## Related Backends

- [sqlite-full-text-search](sqlite-full-text-search.md) - Fast indexed search with BM25 ranking

## See Also

- [Backends Overview](index.md) - All available backends
- [backends.md](../backends.md) - Backend implementation guide
- [extraction.md](../extraction.md) - Text extraction pipeline
- [Extractor Reference](../extractors/index.md) - Text extraction plugins