# Metrics Reference

Complete guide to understanding evaluation metrics used in Biblicus benchmarks.

## Overview

Biblicus uses three categories of metrics to evaluate extraction quality:

1. **Set-Based Metrics**: Measure word-finding ability (position-agnostic)
2. **Order-Aware Metrics**: Measure reading order preservation (sequence quality)
3. **N-gram Overlap**: Measure local ordering quality (word pair/triple accuracy)

Each metric tells you something different about extraction quality. A good extraction pipeline needs:
- **High F1** for accuracy (finding the right words)
- **Low WER** for reading order (words in correct sequence)
- **High bigram/trigram** for local ordering (adjacent words correct)

## Set-Based Metrics (Position-Agnostic)

Set-based metrics measure how well the extraction finds words, regardless of their order. These are the primary accuracy metrics.

### F1 Score

**Harmonic mean of precision and recall.**

- **Range:** 0.0 to 1.0 (higher is better)
- **Primary metric for overall accuracy**
- **Balances precision and recall**

**Formula:**
```
F1 = 2 × (Precision × Recall) / (Precision + Recall)
```

**Interpretation:**
- **F1 ≥ 0.75:** Excellent accuracy
- **F1 ≥ 0.65:** Good accuracy
- **F1 ≥ 0.50:** Acceptable for some use cases
- **F1 < 0.50:** Poor accuracy

**Example:**
- Ground truth: "hello world from biblicus"
- Extracted: "hello world form"
- Precision: 2/3 = 0.667 (2 correct out of 3 extracted)
- Recall: 2/4 = 0.500 (2 found out of 4 ground truth)
- F1: 2 × (0.667 × 0.500) / (0.667 + 0.500) = 0.571

**When to prioritize:**
- General-purpose OCR evaluation
- Comparing pipeline overall quality
- Production system benchmarking

---

### Precision

**Percentage of extracted words that are correct.**

- **Range:** 0.0 to 1.0 (higher is better)
- **Measures false positive rate**
- **High precision = few extra/wrong words**

**Formula:**
```
Precision = TP / (TP + FP)

Where:
  TP = True Positives (correct words found)
  FP = False Positives (incorrect words extracted)
```

**Interpretation:**
- **High Precision, Low Recall:** Conservative extraction (misses text but rarely wrong)
- **Low Precision, High Recall:** Aggressive extraction (finds everything but noisy)
- **Balanced:** Best for most use cases

**Example:**
- Extracted: "hello world form biblicus"
- Ground truth: "hello world from biblicus"
- Correct words: hello, world, biblicus (3 out of 4 extracted)
- Precision: 3/4 = 0.750

**When to prioritize:**
- Noise is expensive (false positives costly)
- Downstream processing assumes clean text
- Indexing/search where quality matters

---

### Recall

**Percentage of ground truth words that were found.**

- **Range:** 0.0 to 1.0 (higher is better)
- **Measures completeness**
- **High recall = finds most words**

**Formula:**
```
Recall = TP / (TP + FN)

Where:
  TP = True Positives (correct words found)
  FN = False Negatives (ground truth words missed)
```

**Interpretation:**
- **Recall ≥ 0.80:** Excellent completeness
- **Recall ≥ 0.70:** Good completeness
- **Recall ≥ 0.60:** Acceptable for some use cases
- **Recall < 0.60:** Missing too much content

**Example:**
- Ground truth: "hello world from biblicus system"
- Extracted: "hello world biblicus"
- Found words: hello, world, biblicus (3 out of 5)
- Recall: 3/5 = 0.600

**When to prioritize:**
- Missing content is expensive
- Legal/compliance documents (can't skip text)
- Search applications (need everything indexed)
- Maximum extraction scenarios

---

### Character Accuracy

**Character-level correctness metric.**

- **Range:** 0.0 to 1.0 (higher is better)
- **More fine-grained than word-level metrics**
- **Useful for partial word matches**

**When to use:**
- OCR quality assessment at character level
- Evaluating partial word extraction
- Fine-grained accuracy analysis

---

## Order-Aware Metrics (Sequence Quality)

Order-aware metrics measure whether words appear in the correct sequence. These are critical for layout-aware OCR evaluation.

### Word Error Rate (WER)

**Edit distance normalized by ground truth length.**

- **Range:** 0.0+ (lower is better, can exceed 1.0)
- **Critical for layout-aware OCR**
- **Counts insertions, deletions, substitutions**

**Formula:**
```
WER = (Insertions + Deletions + Substitutions) / Total_Ground_Truth_Words
```

**Interpretation:**
- **WER ≤ 0.30:** Excellent reading order
- **WER ≤ 0.50:** Good reading order
- **WER ≤ 0.70:** Acceptable for some use cases
- **WER > 1.0:** More errors than words (very poor)

**Example:**
- Ground truth: "hello world from biblicus"
- Extracted: "hello form world biblicus"
- Operations: 1 substitution (form→from) + 0 deletions + 0 insertions = 1
- WER: 1/4 = 0.250

**Why WER can exceed 1.0:**
If you have 100 ground truth words but extract 200 words (100 correct + 100 insertions), WER = 100/100 = 1.0. More insertions push it higher.

**When to prioritize:**
- Multi-column layouts (reading order critical)
- Document understanding (semantic flow)
- Reading aloud applications
- Content summarization

---

### Sequence Accuracy

**Percentage of words in correct sequential position.**

- **Range:** 0.0 to 1.0 (higher is better)
- **Strict metric: word must be at exact position**
- **Very sensitive to small ordering changes**

**Formula:**
```
Sequence Accuracy = Correct_Position_Words / Total_Ground_Truth_Words
```

**Example:**
- Ground truth: ["hello", "world", "from", "biblicus"]
- Extracted: ["hello", "form", "world", "biblicus"]
- Only "hello" at position 0 is correct
- Sequence Accuracy: 1/4 = 0.250

**Interpretation:**
- Very strict metric
- Useful for exact order preservation
- Often lower than other metrics

**When to use:**
- Exact position matters (tables, forms)
- Structured data extraction
- Column-sensitive applications

---

### LCS Ratio (Longest Common Subsequence)

**Ratio of longest ordered subsequence to total.**

- **Range:** 0.0 to 1.0 (higher is better)
- **More forgiving than sequence accuracy**
- **Measures longest preserved ordering**

**Formula:**
```
LCS Ratio = Length(LCS(ground_truth, extracted)) / Length(ground_truth)
```

**Example:**
- Ground truth: "the quick brown fox jumps"
- Extracted: "the brown quick fox"
- LCS: "the brown fox" (length 3)
- LCS Ratio: 3/5 = 0.600

**Interpretation:**
- **LCS ≥ 0.80:** Excellent order preservation
- **LCS ≥ 0.65:** Good order preservation
- **LCS ≥ 0.50:** Acceptable ordering
- **LCS < 0.50:** Poor ordering

**When to use:**
- Primary metric for academic papers
- Multi-column document evaluation
- When partial ordering is acceptable

---

## N-gram Overlap (Local Ordering)

N-gram metrics measure whether adjacent words (bigrams) or word triples (trigrams) appear in the correct order.

### Bigram Overlap

**Percentage of word pairs in correct order.**

- **Range:** 0.0 to 1.0 (higher is better)
- **Good for detecting column mixing**
- **Measures local ordering quality**

**Formula:**
```
Bigram Overlap = Matching_Bigrams / Total_Bigrams

Where a bigram is a pair of adjacent words: ("word1", "word2")
```

**Example:**
- Ground truth: "hello world from biblicus"
- Ground truth bigrams: ("hello", "world"), ("world", "from"), ("from", "biblicus")
- Extracted: "hello world biblicus from"
- Extracted bigrams: ("hello", "world"), ("world", "biblicus"), ("biblicus", "from")
- Matching: ("hello", "world") only
- Bigram Overlap: 1/3 = 0.333

**Interpretation:**
- **Bigram ≥ 0.70:** Excellent local ordering
- **Bigram ≥ 0.55:** Good local ordering
- **Bigram ≥ 0.40:** Acceptable ordering
- **Bigram < 0.40:** Poor local ordering

**When to prioritize:**
- Layout-aware OCR evaluation
- Multi-column document assessment
- Reading flow quality

---

### Trigram Overlap

**Percentage of word triples in correct order.**

- **Range:** 0.0 to 1.0 (higher is better)
- **More sensitive than bigram**
- **Stricter local ordering requirement**

**Formula:**
```
Trigram Overlap = Matching_Trigrams / Total_Trigrams

Where a trigram is three adjacent words: ("word1", "word2", "word3")
```

**Example:**
- Ground truth: "the quick brown fox"
- Ground truth trigrams: ("the", "quick", "brown"), ("quick", "brown", "fox")
- Extracted: "the brown quick fox"
- Extracted trigrams: ("the", "brown", "quick"), ("brown", "quick", "fox")
- Matching: none
- Trigram Overlap: 0/2 = 0.0

**Interpretation:**
- Usually lower than bigram overlap
- More strict ordering requirement
- Good for detailed analysis

**When to use:**
- Fine-grained ordering analysis
- Detailed pipeline comparison
- Academic evaluation

---

## Metric Trade-offs

### Precision vs. Recall

Different pipelines optimize for different trade-offs:

**High Precision, Lower Recall (Conservative):**
- Example: Baseline Tesseract (Precision: 0.615, Recall: 0.599)
- Few false positives, but misses some text
- Best for: Clean text applications, noise-sensitive systems

**Lower Precision, High Recall (Aggressive):**
- Example: Heron + Tesseract (Precision: 0.384, Recall: 0.810)
- Finds most text, but includes noise
- Best for: Legal/compliance, maximum extraction

**Balanced:**
- Example: PaddleOCR (Precision: 0.792, Recall: 0.782, F1: 0.787)
- Good accuracy and completeness
- Best for: General-purpose applications

### Accuracy vs. Reading Order

You can have high F1 (finds words) but poor WER (wrong order):

**Good F1, Poor WER:**
- Finds the right words but in wrong order
- Common in multi-column documents
- Example: Column text mixed together

**Good F1, Good WER:**
- Finds the right words in right order
- Ideal for most applications
- Example: PaddleOCR with layout detection

**Poor F1, Good WER:**
- Rare but possible - finds few words but in correct order
- Usually indicates incomplete extraction

---

## Choosing Metrics for Your Use Case

### Forms (FUNSD-like)

**Primary Metric:** F1 Score
- Measures field extraction accuracy
- Balance of precision and recall matters

**Secondary Metrics:**
- Recall (don't miss fields)
- WER (fields should be in order)

**Target:**
- F1 ≥ 0.75 for production
- Recall ≥ 0.70 minimum

---

### Receipts (Dense Text)

**Primary Metric:** F1 Score
- Entity extraction accuracy critical
- Dense text needs high precision

**Secondary Metrics:**
- Precision (avoid noise in entities)
- Bigram (local ordering for amounts/dates)

**Target:**
- F1 ≥ 0.80 for production
- Precision ≥ 0.75 minimum

---

### Academic Papers (Multi-Column)

**Primary Metric:** LCS Ratio
- Reading order preservation critical
- Multi-column layout understanding

**Secondary Metrics:**
- F1 (overall accuracy)
- WER (sequence quality)
- Bigram (column mixing detection)

**Target:**
- LCS ≥ 0.75 for production
- Bigram ≥ 0.60 minimum

---

### Legal/Compliance Documents

**Primary Metric:** Recall
- Cannot miss any content
- Completeness over accuracy

**Secondary Metrics:**
- F1 (but tolerate lower precision)
- WER (reading order matters)

**Target:**
- Recall ≥ 0.90 minimum
- F1 ≥ 0.70 acceptable

---

## Interpreting Benchmark Results

### Example Report

```json
{
  "pipeline_name": "paddleocr",
  "metrics": {
    "set_based": {
      "avg_precision": 0.792,
      "avg_recall": 0.782,
      "avg_f1": 0.787
    },
    "order_aware": {
      "avg_wer": 0.533,
      "avg_sequence_accuracy": 0.031,
      "avg_lcs_ratio": 0.621
    },
    "ngram": {
      "avg_bigram_overlap": 0.521,
      "avg_trigram_overlap": 0.412
    }
  }
}
```

### Interpretation

**Set-Based Metrics:**
- F1: 0.787 → **Excellent** accuracy (finds 78.7% of words correctly)
- Balanced precision (79.2%) and recall (78.2%)

**Order-Aware Metrics:**
- WER: 0.533 → **Good** reading order (53% error rate acceptable for forms)
- LCS: 0.621 → **Good** longest sequence preservation

**N-gram Metrics:**
- Bigram: 0.521 → **Good** local ordering (52% word pairs correct)

**Overall:** Strong all-around pipeline. High F1 for accuracy, acceptable reading order for forms.

---

### Comparing Two Pipelines

**Scenario: PaddleOCR vs. Heron+Tesseract**

| Metric | PaddleOCR | Heron+Tesseract | Winner |
|--------|-----------|-----------------|--------|
| F1 | 0.787 | 0.519 | PaddleOCR |
| Recall | 0.782 | 0.810 | Heron |
| Precision | 0.792 | 0.384 | PaddleOCR |
| WER | 0.533 | 0.612 | PaddleOCR |
| Bigram | 0.521 | 0.561 | Heron |

**Analysis:**
- **PaddleOCR:** Better overall accuracy (F1), cleaner output (precision), better reading order (WER)
- **Heron+Tesseract:** Finds more text (recall), better local ordering (bigram)

**Choose PaddleOCR if:** You need clean, accurate extraction
**Choose Heron if:** You need maximum extraction, can tolerate noise

---

## Metric Calculation Details

### Set-Based Calculation

Words are normalized before comparison:
- Lowercase
- Remove punctuation
- Trim whitespace

Example:
- Ground truth: "Hello, World!"
- Extracted: "hello world"
- Match: Both normalize to ["hello", "world"]

### Order-Aware Calculation

Word sequences are compared position-by-position:
- Insertions: Extra words in extracted text
- Deletions: Missing words from ground truth
- Substitutions: Wrong words in extracted text

Example:
- Ground truth: ["the", "quick", "fox"]
- Extracted: ["the", "slow", "fox"]
- Operations: 1 substitution (quick→slow)
- WER: 1/3 = 0.333

### N-gram Calculation

Sliding window over word sequences:
- Bigram: Window size 2
- Trigram: Window size 3

Example bigram calculation:
- Ground truth: ["a", "b", "c"]
- Bigrams: [("a", "b"), ("b", "c")]
- Extracted: ["a", "c", "b"]
- Bigrams: [("a", "c"), ("c", "b")]
- Matching: none
- Bigram overlap: 0/2 = 0.0

---

## Tools and Libraries

Biblicus uses:
- **editdistance** (optional): Fast Levenshtein distance for WER
- **difflib**: Python standard library for sequence matching
- **nltk** (optional): Advanced text normalization

Without editdistance, WER calculation falls back to difflib (slower but works).

---

## Next Steps

- **[Run benchmarks](quickstart-benchmarking.md)** to generate metrics on your data
- **[Explore pipelines](pipeline-catalog.md)** to understand trade-offs
- **[View current results](benchmark-results.md)** to see metric examples
- **[OCR Benchmarking Guide](ocr-benchmarking.md)** for practical evaluation

---

## References

- Metrics implementation: `src/biblicus/evaluation/ocr_benchmark.py`
- Benchmark runner: `src/biblicus/evaluation/benchmark_runner.py`
- [Multi-Category Benchmark Framework](document-understanding-benchmark.md)
- [Benchmarking Overview](benchmarking-overview.md)