Metrics Reference
Complete guide to understanding evaluation metrics used in Biblicus benchmarks.
Overview
Biblicus uses three categories of metrics to evaluate extraction quality:
Set-Based Metrics: Measure word-finding ability (position-agnostic)
Order-Aware Metrics: Measure reading order preservation (sequence quality)
N-gram Overlap: Measure local ordering quality (word pair/triple accuracy)
Each metric tells you something different about extraction quality. A good extraction pipeline needs:
High F1 for accuracy (finding the right words)
Low WER for reading order (words in correct sequence)
High bigram/trigram for local ordering (adjacent words correct)
Set-Based Metrics (Position-Agnostic)
Set-based metrics measure how well the extraction finds words, regardless of their order. These are the primary accuracy metrics.
F1 Score
Harmonic mean of precision and recall.
Range: 0.0 to 1.0 (higher is better)
Primary metric for overall accuracy
Balances precision and recall
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Interpretation:
F1 ≥ 0.75: Excellent accuracy
F1 ≥ 0.65: Good accuracy
F1 ≥ 0.50: Acceptable for some use cases
F1 < 0.50: Poor accuracy
Example:
Ground truth: “hello world from biblicus”
Extracted: “hello world form”
Precision: 2/3 = 0.667 (2 correct out of 3 extracted)
Recall: 2/4 = 0.500 (2 found out of 4 ground truth)
F1: 2 × (0.667 × 0.500) / (0.667 + 0.500) = 0.571
When to prioritize:
General-purpose OCR evaluation
Comparing pipeline overall quality
Production system benchmarking
Precision
Percentage of extracted words that are correct.
Range: 0.0 to 1.0 (higher is better)
Measures false positive rate
High precision = few extra/wrong words
Formula:
Precision = TP / (TP + FP)
Where:
TP = True Positives (correct words found)
FP = False Positives (incorrect words extracted)
Interpretation:
High Precision, Low Recall: Conservative extraction (misses text but rarely wrong)
Low Precision, High Recall: Aggressive extraction (finds everything but noisy)
Balanced: Best for most use cases
Example:
Extracted: “hello world form biblicus”
Ground truth: “hello world from biblicus”
Correct words: hello, world, biblicus (3 out of 4 extracted)
Precision: 3/4 = 0.750
When to prioritize:
Noise is expensive (false positives costly)
Downstream processing assumes clean text
Indexing/search where quality matters
Recall
Percentage of ground truth words that were found.
Range: 0.0 to 1.0 (higher is better)
Measures completeness
High recall = finds most words
Formula:
Recall = TP / (TP + FN)
Where:
TP = True Positives (correct words found)
FN = False Negatives (ground truth words missed)
Interpretation:
Recall ≥ 0.80: Excellent completeness
Recall ≥ 0.70: Good completeness
Recall ≥ 0.60: Acceptable for some use cases
Recall < 0.60: Missing too much content
Example:
Ground truth: “hello world from biblicus system”
Extracted: “hello world biblicus”
Found words: hello, world, biblicus (3 out of 5)
Recall: 3/5 = 0.600
When to prioritize:
Missing content is expensive
Legal/compliance documents (can’t skip text)
Search applications (need everything indexed)
Maximum extraction scenarios
Character Accuracy
Character-level correctness metric.
Range: 0.0 to 1.0 (higher is better)
More fine-grained than word-level metrics
Useful for partial word matches
When to use:
OCR quality assessment at character level
Evaluating partial word extraction
Fine-grained accuracy analysis
Order-Aware Metrics (Sequence Quality)
Order-aware metrics measure whether words appear in the correct sequence. These are critical for layout-aware OCR evaluation.
Word Error Rate (WER)
Edit distance normalized by ground truth length.
Range: 0.0+ (lower is better, can exceed 1.0)
Critical for layout-aware OCR
Counts insertions, deletions, substitutions
Formula:
WER = (Insertions + Deletions + Substitutions) / Total_Ground_Truth_Words
Interpretation:
WER ≤ 0.30: Excellent reading order
WER ≤ 0.50: Good reading order
WER ≤ 0.70: Acceptable for some use cases
WER > 1.0: More errors than words (very poor)
Example:
Ground truth: “hello world from biblicus”
Extracted: “hello form world biblicus”
Operations: 1 substitution (form→from) + 0 deletions + 0 insertions = 1
WER: 1/4 = 0.250
Why WER can exceed 1.0: If you have 100 ground truth words but extract 200 words (100 correct + 100 insertions), WER = 100/100 = 1.0. More insertions push it higher.
When to prioritize:
Multi-column layouts (reading order critical)
Document understanding (semantic flow)
Reading aloud applications
Content summarization
Sequence Accuracy
Percentage of words in correct sequential position.
Range: 0.0 to 1.0 (higher is better)
Strict metric: word must be at exact position
Very sensitive to small ordering changes
Formula:
Sequence Accuracy = Correct_Position_Words / Total_Ground_Truth_Words
Example:
Ground truth: [“hello”, “world”, “from”, “biblicus”]
Extracted: [“hello”, “form”, “world”, “biblicus”]
Only “hello” at position 0 is correct
Sequence Accuracy: 1/4 = 0.250
Interpretation:
Very strict metric
Useful for exact order preservation
Often lower than other metrics
When to use:
Exact position matters (tables, forms)
Structured data extraction
Column-sensitive applications
LCS Ratio (Longest Common Subsequence)
Ratio of longest ordered subsequence to total.
Range: 0.0 to 1.0 (higher is better)
More forgiving than sequence accuracy
Measures longest preserved ordering
Formula:
LCS Ratio = Length(LCS(ground_truth, extracted)) / Length(ground_truth)
Example:
Ground truth: “the quick brown fox jumps”
Extracted: “the brown quick fox”
LCS: “the brown fox” (length 3)
LCS Ratio: 3/5 = 0.600
Interpretation:
LCS ≥ 0.80: Excellent order preservation
LCS ≥ 0.65: Good order preservation
LCS ≥ 0.50: Acceptable ordering
LCS < 0.50: Poor ordering
When to use:
Primary metric for academic papers
Multi-column document evaluation
When partial ordering is acceptable
N-gram Overlap (Local Ordering)
N-gram metrics measure whether adjacent words (bigrams) or word triples (trigrams) appear in the correct order.
Bigram Overlap
Percentage of word pairs in correct order.
Range: 0.0 to 1.0 (higher is better)
Good for detecting column mixing
Measures local ordering quality
Formula:
Bigram Overlap = Matching_Bigrams / Total_Bigrams
Where a bigram is a pair of adjacent words: ("word1", "word2")
Example:
Ground truth: “hello world from biblicus”
Ground truth bigrams: (“hello”, “world”), (“world”, “from”), (“from”, “biblicus”)
Extracted: “hello world biblicus from”
Extracted bigrams: (“hello”, “world”), (“world”, “biblicus”), (“biblicus”, “from”)
Matching: (“hello”, “world”) only
Bigram Overlap: 1/3 = 0.333
Interpretation:
Bigram ≥ 0.70: Excellent local ordering
Bigram ≥ 0.55: Good local ordering
Bigram ≥ 0.40: Acceptable ordering
Bigram < 0.40: Poor local ordering
When to prioritize:
Layout-aware OCR evaluation
Multi-column document assessment
Reading flow quality
Trigram Overlap
Percentage of word triples in correct order.
Range: 0.0 to 1.0 (higher is better)
More sensitive than bigram
Stricter local ordering requirement
Formula:
Trigram Overlap = Matching_Trigrams / Total_Trigrams
Where a trigram is three adjacent words: ("word1", "word2", "word3")
Example:
Ground truth: “the quick brown fox”
Ground truth trigrams: (“the”, “quick”, “brown”), (“quick”, “brown”, “fox”)
Extracted: “the brown quick fox”
Extracted trigrams: (“the”, “brown”, “quick”), (“brown”, “quick”, “fox”)
Matching: none
Trigram Overlap: 0/2 = 0.0
Interpretation:
Usually lower than bigram overlap
More strict ordering requirement
Good for detailed analysis
When to use:
Fine-grained ordering analysis
Detailed pipeline comparison
Academic evaluation
Metric Trade-offs
Precision vs. Recall
Different pipelines optimize for different trade-offs:
High Precision, Lower Recall (Conservative):
Example: Baseline Tesseract (Precision: 0.615, Recall: 0.599)
Few false positives, but misses some text
Best for: Clean text applications, noise-sensitive systems
Lower Precision, High Recall (Aggressive):
Example: Heron + Tesseract (Precision: 0.384, Recall: 0.810)
Finds most text, but includes noise
Best for: Legal/compliance, maximum extraction
Balanced:
Example: PaddleOCR (Precision: 0.792, Recall: 0.782, F1: 0.787)
Good accuracy and completeness
Best for: General-purpose applications
Accuracy vs. Reading Order
You can have high F1 (finds words) but poor WER (wrong order):
Good F1, Poor WER:
Finds the right words but in wrong order
Common in multi-column documents
Example: Column text mixed together
Good F1, Good WER:
Finds the right words in right order
Ideal for most applications
Example: PaddleOCR with layout detection
Poor F1, Good WER:
Rare but possible - finds few words but in correct order
Usually indicates incomplete extraction
Choosing Metrics for Your Use Case
Forms (FUNSD-like)
Primary Metric: F1 Score
Measures field extraction accuracy
Balance of precision and recall matters
Secondary Metrics:
Recall (don’t miss fields)
WER (fields should be in order)
Target:
F1 ≥ 0.75 for production
Recall ≥ 0.70 minimum
Receipts (Dense Text)
Primary Metric: F1 Score
Entity extraction accuracy critical
Dense text needs high precision
Secondary Metrics:
Precision (avoid noise in entities)
Bigram (local ordering for amounts/dates)
Target:
F1 ≥ 0.80 for production
Precision ≥ 0.75 minimum
Academic Papers (Multi-Column)
Primary Metric: LCS Ratio
Reading order preservation critical
Multi-column layout understanding
Secondary Metrics:
F1 (overall accuracy)
WER (sequence quality)
Bigram (column mixing detection)
Target:
LCS ≥ 0.75 for production
Bigram ≥ 0.60 minimum
Legal/Compliance Documents
Primary Metric: Recall
Cannot miss any content
Completeness over accuracy
Secondary Metrics:
F1 (but tolerate lower precision)
WER (reading order matters)
Target:
Recall ≥ 0.90 minimum
F1 ≥ 0.70 acceptable
Interpreting Benchmark Results
Example Report
{
"pipeline_name": "paddleocr",
"metrics": {
"set_based": {
"avg_precision": 0.792,
"avg_recall": 0.782,
"avg_f1": 0.787
},
"order_aware": {
"avg_wer": 0.533,
"avg_sequence_accuracy": 0.031,
"avg_lcs_ratio": 0.621
},
"ngram": {
"avg_bigram_overlap": 0.521,
"avg_trigram_overlap": 0.412
}
}
}
Interpretation
Set-Based Metrics:
F1: 0.787 → Excellent accuracy (finds 78.7% of words correctly)
Balanced precision (79.2%) and recall (78.2%)
Order-Aware Metrics:
WER: 0.533 → Good reading order (53% error rate acceptable for forms)
LCS: 0.621 → Good longest sequence preservation
N-gram Metrics:
Bigram: 0.521 → Good local ordering (52% word pairs correct)
Overall: Strong all-around pipeline. High F1 for accuracy, acceptable reading order for forms.
Comparing Two Pipelines
Scenario: PaddleOCR vs. Heron+Tesseract
Metric |
PaddleOCR |
Heron+Tesseract |
Winner |
|---|---|---|---|
F1 |
0.787 |
0.519 |
PaddleOCR |
Recall |
0.782 |
0.810 |
Heron |
Precision |
0.792 |
0.384 |
PaddleOCR |
WER |
0.533 |
0.612 |
PaddleOCR |
Bigram |
0.521 |
0.561 |
Heron |
Analysis:
PaddleOCR: Better overall accuracy (F1), cleaner output (precision), better reading order (WER)
Heron+Tesseract: Finds more text (recall), better local ordering (bigram)
Choose PaddleOCR if: You need clean, accurate extraction Choose Heron if: You need maximum extraction, can tolerate noise
Metric Calculation Details
Set-Based Calculation
Words are normalized before comparison:
Lowercase
Remove punctuation
Trim whitespace
Example:
Ground truth: “Hello, World!”
Extracted: “hello world”
Match: Both normalize to [“hello”, “world”]
Order-Aware Calculation
Word sequences are compared position-by-position:
Insertions: Extra words in extracted text
Deletions: Missing words from ground truth
Substitutions: Wrong words in extracted text
Example:
Ground truth: [“the”, “quick”, “fox”]
Extracted: [“the”, “slow”, “fox”]
Operations: 1 substitution (quick→slow)
WER: 1/3 = 0.333
N-gram Calculation
Sliding window over word sequences:
Bigram: Window size 2
Trigram: Window size 3
Example bigram calculation:
Ground truth: [“a”, “b”, “c”]
Bigrams: [(“a”, “b”), (“b”, “c”)]
Extracted: [“a”, “c”, “b”]
Bigrams: [(“a”, “c”), (“c”, “b”)]
Matching: none
Bigram overlap: 0/2 = 0.0
Tools and Libraries
Biblicus uses:
editdistance (optional): Fast Levenshtein distance for WER
difflib: Python standard library for sequence matching
nltk (optional): Advanced text normalization
Without editdistance, WER calculation falls back to difflib (slower but works).
Next Steps
Run benchmarks to generate metrics on your data
Explore pipelines to understand trade-offs
View current results to see metric examples
OCR Benchmarking Guide for practical evaluation
References
Metrics implementation:
src/biblicus/evaluation/ocr_benchmark.pyBenchmark runner:
src/biblicus/evaluation/benchmark_runner.py