Metrics Reference

Complete guide to understanding evaluation metrics used in Biblicus benchmarks.

Overview

Biblicus uses three categories of metrics to evaluate extraction quality:

Set-Based Metrics: Measure word-finding ability (position-agnostic)
Order-Aware Metrics: Measure reading order preservation (sequence quality)
N-gram Overlap: Measure local ordering quality (word pair/triple accuracy)

Each metric tells you something different about extraction quality. A good extraction pipeline needs:

High F1 for accuracy (finding the right words)
Low WER for reading order (words in correct sequence)
High bigram/trigram for local ordering (adjacent words correct)

Set-Based Metrics (Position-Agnostic)

Set-based metrics measure how well the extraction finds words, regardless of their order. These are the primary accuracy metrics.

F1 Score

Harmonic mean of precision and recall.

Range: 0.0 to 1.0 (higher is better)
Primary metric for overall accuracy
Balances precision and recall

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Interpretation:

F1 ≥ 0.75: Excellent accuracy
F1 ≥ 0.65: Good accuracy
F1 ≥ 0.50: Acceptable for some use cases
F1 < 0.50: Poor accuracy

Example:

Ground truth: “hello world from biblicus”
Extracted: “hello world form”
Precision: 2/3 = 0.667 (2 correct out of 3 extracted)
Recall: 2/4 = 0.500 (2 found out of 4 ground truth)
F1: 2 × (0.667 × 0.500) / (0.667 + 0.500) = 0.571

When to prioritize:

General-purpose OCR evaluation
Comparing pipeline overall quality
Production system benchmarking

Precision

Percentage of extracted words that are correct.

Range: 0.0 to 1.0 (higher is better)
Measures false positive rate
High precision = few extra/wrong words

Formula:

Precision = TP / (TP + FP)

Where:
  TP = True Positives (correct words found)
  FP = False Positives (incorrect words extracted)

Interpretation:

High Precision, Low Recall: Conservative extraction (misses text but rarely wrong)
Low Precision, High Recall: Aggressive extraction (finds everything but noisy)
Balanced: Best for most use cases

Example:

Extracted: “hello world form biblicus”
Ground truth: “hello world from biblicus”
Correct words: hello, world, biblicus (3 out of 4 extracted)
Precision: 3/4 = 0.750

When to prioritize:

Noise is expensive (false positives costly)
Downstream processing assumes clean text
Indexing/search where quality matters

Recall

Percentage of ground truth words that were found.

Range: 0.0 to 1.0 (higher is better)
Measures completeness
High recall = finds most words

Formula:

Recall = TP / (TP + FN)

Where:
  TP = True Positives (correct words found)
  FN = False Negatives (ground truth words missed)

Interpretation:

Recall ≥ 0.80: Excellent completeness
Recall ≥ 0.70: Good completeness
Recall ≥ 0.60: Acceptable for some use cases
Recall < 0.60: Missing too much content

Example:

Ground truth: “hello world from biblicus system”
Extracted: “hello world biblicus”
Found words: hello, world, biblicus (3 out of 5)
Recall: 3/5 = 0.600

When to prioritize:

Missing content is expensive
Legal/compliance documents (can’t skip text)
Search applications (need everything indexed)
Maximum extraction scenarios

Character Accuracy

Character-level correctness metric.

Range: 0.0 to 1.0 (higher is better)
More fine-grained than word-level metrics
Useful for partial word matches

When to use:

OCR quality assessment at character level
Evaluating partial word extraction
Fine-grained accuracy analysis

Order-Aware Metrics (Sequence Quality)

Order-aware metrics measure whether words appear in the correct sequence. These are critical for layout-aware OCR evaluation.

Word Error Rate (WER)

Edit distance normalized by ground truth length.

Range: 0.0+ (lower is better, can exceed 1.0)
Critical for layout-aware OCR
Counts insertions, deletions, substitutions

Formula:

WER = (Insertions + Deletions + Substitutions) / Total_Ground_Truth_Words

Interpretation:

WER ≤ 0.30: Excellent reading order
WER ≤ 0.50: Good reading order
WER ≤ 0.70: Acceptable for some use cases
WER > 1.0: More errors than words (very poor)

Example:

Ground truth: “hello world from biblicus”
Extracted: “hello form world biblicus”
Operations: 1 substitution (form→from) + 0 deletions + 0 insertions = 1
WER: 1/4 = 0.250

Why WER can exceed 1.0: If you have 100 ground truth words but extract 200 words (100 correct + 100 insertions), WER = 100/100 = 1.0. More insertions push it higher.

When to prioritize:

Multi-column layouts (reading order critical)
Document understanding (semantic flow)
Reading aloud applications
Content summarization

Sequence Accuracy

Percentage of words in correct sequential position.

Range: 0.0 to 1.0 (higher is better)
Strict metric: word must be at exact position
Very sensitive to small ordering changes

Formula:

Sequence Accuracy = Correct_Position_Words / Total_Ground_Truth_Words

Example:

Ground truth: [“hello”, “world”, “from”, “biblicus”]
Extracted: [“hello”, “form”, “world”, “biblicus”]
Only “hello” at position 0 is correct
Sequence Accuracy: 1/4 = 0.250

Interpretation:

Very strict metric
Useful for exact order preservation
Often lower than other metrics

When to use:

Exact position matters (tables, forms)
Structured data extraction
Column-sensitive applications

LCS Ratio (Longest Common Subsequence)

Ratio of longest ordered subsequence to total.

Range: 0.0 to 1.0 (higher is better)
More forgiving than sequence accuracy
Measures longest preserved ordering

Formula:

LCS Ratio = Length(LCS(ground_truth, extracted)) / Length(ground_truth)

Example:

Ground truth: “the quick brown fox jumps”
Extracted: “the brown quick fox”
LCS: “the brown fox” (length 3)
LCS Ratio: 3/5 = 0.600

Interpretation:

LCS ≥ 0.80: Excellent order preservation
LCS ≥ 0.65: Good order preservation
LCS ≥ 0.50: Acceptable ordering
LCS < 0.50: Poor ordering

When to use:

Primary metric for academic papers
Multi-column document evaluation
When partial ordering is acceptable

N-gram Overlap (Local Ordering)

N-gram metrics measure whether adjacent words (bigrams) or word triples (trigrams) appear in the correct order.

Bigram Overlap

Percentage of word pairs in correct order.

Range: 0.0 to 1.0 (higher is better)
Good for detecting column mixing
Measures local ordering quality

Formula:

Bigram Overlap = Matching_Bigrams / Total_Bigrams

Where a bigram is a pair of adjacent words: ("word1", "word2")

Example:

Ground truth: “hello world from biblicus”
Ground truth bigrams: (“hello”, “world”), (“world”, “from”), (“from”, “biblicus”)
Extracted: “hello world biblicus from”
Extracted bigrams: (“hello”, “world”), (“world”, “biblicus”), (“biblicus”, “from”)
Matching: (“hello”, “world”) only
Bigram Overlap: 1/3 = 0.333

Interpretation:

Bigram ≥ 0.70: Excellent local ordering
Bigram ≥ 0.55: Good local ordering
Bigram ≥ 0.40: Acceptable ordering
Bigram < 0.40: Poor local ordering

When to prioritize:

Layout-aware OCR evaluation
Multi-column document assessment
Reading flow quality

Trigram Overlap

Percentage of word triples in correct order.

Range: 0.0 to 1.0 (higher is better)
More sensitive than bigram
Stricter local ordering requirement

Formula:

Trigram Overlap = Matching_Trigrams / Total_Trigrams

Where a trigram is three adjacent words: ("word1", "word2", "word3")

Example:

Ground truth: “the quick brown fox”
Ground truth trigrams: (“the”, “quick”, “brown”), (“quick”, “brown”, “fox”)
Extracted: “the brown quick fox”
Extracted trigrams: (“the”, “brown”, “quick”), (“brown”, “quick”, “fox”)
Matching: none
Trigram Overlap: 0/2 = 0.0

Interpretation:

Usually lower than bigram overlap
More strict ordering requirement
Good for detailed analysis

When to use:

Fine-grained ordering analysis
Detailed pipeline comparison
Academic evaluation

Metric Trade-offs

Precision vs. Recall

Different pipelines optimize for different trade-offs:

High Precision, Lower Recall (Conservative):

Example: Baseline Tesseract (Precision: 0.615, Recall: 0.599)
Few false positives, but misses some text
Best for: Clean text applications, noise-sensitive systems

Lower Precision, High Recall (Aggressive):

Example: Heron + Tesseract (Precision: 0.384, Recall: 0.810)
Finds most text, but includes noise
Best for: Legal/compliance, maximum extraction

Balanced:

Example: PaddleOCR (Precision: 0.792, Recall: 0.782, F1: 0.787)
Good accuracy and completeness
Best for: General-purpose applications

Accuracy vs. Reading Order

You can have high F1 (finds words) but poor WER (wrong order):

Good F1, Poor WER:

Finds the right words but in wrong order
Common in multi-column documents
Example: Column text mixed together

Good F1, Good WER:

Finds the right words in right order
Ideal for most applications
Example: PaddleOCR with layout detection

Poor F1, Good WER:

Rare but possible - finds few words but in correct order
Usually indicates incomplete extraction

Choosing Metrics for Your Use Case

Forms (FUNSD-like)

Primary Metric: F1 Score

Measures field extraction accuracy
Balance of precision and recall matters

Secondary Metrics:

Recall (don’t miss fields)
WER (fields should be in order)

Target:

F1 ≥ 0.75 for production
Recall ≥ 0.70 minimum

Receipts (Dense Text)

Primary Metric: F1 Score

Entity extraction accuracy critical
Dense text needs high precision

Secondary Metrics:

Precision (avoid noise in entities)
Bigram (local ordering for amounts/dates)

Target:

F1 ≥ 0.80 for production
Precision ≥ 0.75 minimum

Academic Papers (Multi-Column)

Primary Metric: LCS Ratio

Reading order preservation critical
Multi-column layout understanding

Secondary Metrics:

F1 (overall accuracy)
WER (sequence quality)
Bigram (column mixing detection)

Target:

LCS ≥ 0.75 for production
Bigram ≥ 0.60 minimum

Legal/Compliance Documents

Primary Metric: Recall

Cannot miss any content
Completeness over accuracy

Secondary Metrics:

F1 (but tolerate lower precision)
WER (reading order matters)

Target:

Recall ≥ 0.90 minimum
F1 ≥ 0.70 acceptable

Interpreting Benchmark Results

Example Report

{
  "pipeline_name": "paddleocr",
  "metrics": {
    "set_based": {
      "avg_precision": 0.792,
      "avg_recall": 0.782,
      "avg_f1": 0.787
    },
    "order_aware": {
      "avg_wer": 0.533,
      "avg_sequence_accuracy": 0.031,
      "avg_lcs_ratio": 0.621
    },
    "ngram": {
      "avg_bigram_overlap": 0.521,
      "avg_trigram_overlap": 0.412
    }
  }
}

Interpretation

Set-Based Metrics:

F1: 0.787 → Excellent accuracy (finds 78.7% of words correctly)
Balanced precision (79.2%) and recall (78.2%)

Order-Aware Metrics:

WER: 0.533 → Good reading order (53% error rate acceptable for forms)
LCS: 0.621 → Good longest sequence preservation

N-gram Metrics:

Bigram: 0.521 → Good local ordering (52% word pairs correct)

Overall: Strong all-around pipeline. High F1 for accuracy, acceptable reading order for forms.

Comparing Two Pipelines

Scenario: PaddleOCR vs. Heron+Tesseract

Metric	PaddleOCR	Heron+Tesseract	Winner
F1	0.787	0.519	PaddleOCR
Recall	0.782	0.810	Heron
Precision	0.792	0.384	PaddleOCR
WER	0.533	0.612	PaddleOCR
Bigram	0.521	0.561	Heron

Analysis:

PaddleOCR: Better overall accuracy (F1), cleaner output (precision), better reading order (WER)
Heron+Tesseract: Finds more text (recall), better local ordering (bigram)

Choose PaddleOCR if: You need clean, accurate extraction Choose Heron if: You need maximum extraction, can tolerate noise

Metric Calculation Details

Set-Based Calculation

Words are normalized before comparison:

Lowercase
Remove punctuation
Trim whitespace

Example:

Ground truth: “Hello, World!”
Extracted: “hello world”
Match: Both normalize to [“hello”, “world”]

Order-Aware Calculation

Word sequences are compared position-by-position:

Insertions: Extra words in extracted text
Deletions: Missing words from ground truth
Substitutions: Wrong words in extracted text

Example:

Ground truth: [“the”, “quick”, “fox”]
Extracted: [“the”, “slow”, “fox”]
Operations: 1 substitution (quick→slow)
WER: 1/3 = 0.333

N-gram Calculation

Sliding window over word sequences:

Bigram: Window size 2
Trigram: Window size 3

Example bigram calculation:

Ground truth: [“a”, “b”, “c”]
Bigrams: [(“a”, “b”), (“b”, “c”)]
Extracted: [“a”, “c”, “b”]
Bigrams: [(“a”, “c”), (“c”, “b”)]
Matching: none
Bigram overlap: 0/2 = 0.0

Tools and Libraries

Biblicus uses:

editdistance (optional): Fast Levenshtein distance for WER
difflib: Python standard library for sequence matching
nltk (optional): Advanced text normalization

Without editdistance, WER calculation falls back to difflib (slower but works).

Next Steps

Run benchmarks to generate metrics on your data
Explore pipelines to understand trade-offs
View current results to see metric examples
OCR Benchmarking Guide for practical evaluation

References

Metrics implementation: src/biblicus/evaluation/ocr_benchmark.py
Benchmark runner: src/biblicus/evaluation/benchmark_runner.py
Multi-Category Benchmark Framework
Benchmarking Overview