Metrics Reference

Complete guide to understanding evaluation metrics used in Biblicus benchmarks.

Overview

Biblicus uses three categories of metrics to evaluate extraction quality:

  1. Set-Based Metrics: Measure word-finding ability (position-agnostic)

  2. Order-Aware Metrics: Measure reading order preservation (sequence quality)

  3. N-gram Overlap: Measure local ordering quality (word pair/triple accuracy)

Each metric tells you something different about extraction quality. A good extraction pipeline needs:

  • High F1 for accuracy (finding the right words)

  • Low WER for reading order (words in correct sequence)

  • High bigram/trigram for local ordering (adjacent words correct)

Set-Based Metrics (Position-Agnostic)

Set-based metrics measure how well the extraction finds words, regardless of their order. These are the primary accuracy metrics.

F1 Score

Harmonic mean of precision and recall.

  • Range: 0.0 to 1.0 (higher is better)

  • Primary metric for overall accuracy

  • Balances precision and recall

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Interpretation:

  • F1 ≥ 0.75: Excellent accuracy

  • F1 ≥ 0.65: Good accuracy

  • F1 ≥ 0.50: Acceptable for some use cases

  • F1 < 0.50: Poor accuracy

Example:

  • Ground truth: “hello world from biblicus”

  • Extracted: “hello world form”

  • Precision: 2/3 = 0.667 (2 correct out of 3 extracted)

  • Recall: 2/4 = 0.500 (2 found out of 4 ground truth)

  • F1: 2 × (0.667 × 0.500) / (0.667 + 0.500) = 0.571

When to prioritize:

  • General-purpose OCR evaluation

  • Comparing pipeline overall quality

  • Production system benchmarking


Precision

Percentage of extracted words that are correct.

  • Range: 0.0 to 1.0 (higher is better)

  • Measures false positive rate

  • High precision = few extra/wrong words

Formula:

Precision = TP / (TP + FP)

Where:
  TP = True Positives (correct words found)
  FP = False Positives (incorrect words extracted)

Interpretation:

  • High Precision, Low Recall: Conservative extraction (misses text but rarely wrong)

  • Low Precision, High Recall: Aggressive extraction (finds everything but noisy)

  • Balanced: Best for most use cases

Example:

  • Extracted: “hello world form biblicus”

  • Ground truth: “hello world from biblicus”

  • Correct words: hello, world, biblicus (3 out of 4 extracted)

  • Precision: 3/4 = 0.750

When to prioritize:

  • Noise is expensive (false positives costly)

  • Downstream processing assumes clean text

  • Indexing/search where quality matters


Recall

Percentage of ground truth words that were found.

  • Range: 0.0 to 1.0 (higher is better)

  • Measures completeness

  • High recall = finds most words

Formula:

Recall = TP / (TP + FN)

Where:
  TP = True Positives (correct words found)
  FN = False Negatives (ground truth words missed)

Interpretation:

  • Recall ≥ 0.80: Excellent completeness

  • Recall ≥ 0.70: Good completeness

  • Recall ≥ 0.60: Acceptable for some use cases

  • Recall < 0.60: Missing too much content

Example:

  • Ground truth: “hello world from biblicus system”

  • Extracted: “hello world biblicus”

  • Found words: hello, world, biblicus (3 out of 5)

  • Recall: 3/5 = 0.600

When to prioritize:

  • Missing content is expensive

  • Legal/compliance documents (can’t skip text)

  • Search applications (need everything indexed)

  • Maximum extraction scenarios


Character Accuracy

Character-level correctness metric.

  • Range: 0.0 to 1.0 (higher is better)

  • More fine-grained than word-level metrics

  • Useful for partial word matches

When to use:

  • OCR quality assessment at character level

  • Evaluating partial word extraction

  • Fine-grained accuracy analysis


Order-Aware Metrics (Sequence Quality)

Order-aware metrics measure whether words appear in the correct sequence. These are critical for layout-aware OCR evaluation.

Word Error Rate (WER)

Edit distance normalized by ground truth length.

  • Range: 0.0+ (lower is better, can exceed 1.0)

  • Critical for layout-aware OCR

  • Counts insertions, deletions, substitutions

Formula:

WER = (Insertions + Deletions + Substitutions) / Total_Ground_Truth_Words

Interpretation:

  • WER ≤ 0.30: Excellent reading order

  • WER ≤ 0.50: Good reading order

  • WER ≤ 0.70: Acceptable for some use cases

  • WER > 1.0: More errors than words (very poor)

Example:

  • Ground truth: “hello world from biblicus”

  • Extracted: “hello form world biblicus”

  • Operations: 1 substitution (form→from) + 0 deletions + 0 insertions = 1

  • WER: 1/4 = 0.250

Why WER can exceed 1.0: If you have 100 ground truth words but extract 200 words (100 correct + 100 insertions), WER = 100/100 = 1.0. More insertions push it higher.

When to prioritize:

  • Multi-column layouts (reading order critical)

  • Document understanding (semantic flow)

  • Reading aloud applications

  • Content summarization


Sequence Accuracy

Percentage of words in correct sequential position.

  • Range: 0.0 to 1.0 (higher is better)

  • Strict metric: word must be at exact position

  • Very sensitive to small ordering changes

Formula:

Sequence Accuracy = Correct_Position_Words / Total_Ground_Truth_Words

Example:

  • Ground truth: [“hello”, “world”, “from”, “biblicus”]

  • Extracted: [“hello”, “form”, “world”, “biblicus”]

  • Only “hello” at position 0 is correct

  • Sequence Accuracy: 1/4 = 0.250

Interpretation:

  • Very strict metric

  • Useful for exact order preservation

  • Often lower than other metrics

When to use:

  • Exact position matters (tables, forms)

  • Structured data extraction

  • Column-sensitive applications


LCS Ratio (Longest Common Subsequence)

Ratio of longest ordered subsequence to total.

  • Range: 0.0 to 1.0 (higher is better)

  • More forgiving than sequence accuracy

  • Measures longest preserved ordering

Formula:

LCS Ratio = Length(LCS(ground_truth, extracted)) / Length(ground_truth)

Example:

  • Ground truth: “the quick brown fox jumps”

  • Extracted: “the brown quick fox”

  • LCS: “the brown fox” (length 3)

  • LCS Ratio: 3/5 = 0.600

Interpretation:

  • LCS ≥ 0.80: Excellent order preservation

  • LCS ≥ 0.65: Good order preservation

  • LCS ≥ 0.50: Acceptable ordering

  • LCS < 0.50: Poor ordering

When to use:

  • Primary metric for academic papers

  • Multi-column document evaluation

  • When partial ordering is acceptable


N-gram Overlap (Local Ordering)

N-gram metrics measure whether adjacent words (bigrams) or word triples (trigrams) appear in the correct order.

Bigram Overlap

Percentage of word pairs in correct order.

  • Range: 0.0 to 1.0 (higher is better)

  • Good for detecting column mixing

  • Measures local ordering quality

Formula:

Bigram Overlap = Matching_Bigrams / Total_Bigrams

Where a bigram is a pair of adjacent words: ("word1", "word2")

Example:

  • Ground truth: “hello world from biblicus”

  • Ground truth bigrams: (“hello”, “world”), (“world”, “from”), (“from”, “biblicus”)

  • Extracted: “hello world biblicus from”

  • Extracted bigrams: (“hello”, “world”), (“world”, “biblicus”), (“biblicus”, “from”)

  • Matching: (“hello”, “world”) only

  • Bigram Overlap: 1/3 = 0.333

Interpretation:

  • Bigram ≥ 0.70: Excellent local ordering

  • Bigram ≥ 0.55: Good local ordering

  • Bigram ≥ 0.40: Acceptable ordering

  • Bigram < 0.40: Poor local ordering

When to prioritize:

  • Layout-aware OCR evaluation

  • Multi-column document assessment

  • Reading flow quality


Trigram Overlap

Percentage of word triples in correct order.

  • Range: 0.0 to 1.0 (higher is better)

  • More sensitive than bigram

  • Stricter local ordering requirement

Formula:

Trigram Overlap = Matching_Trigrams / Total_Trigrams

Where a trigram is three adjacent words: ("word1", "word2", "word3")

Example:

  • Ground truth: “the quick brown fox”

  • Ground truth trigrams: (“the”, “quick”, “brown”), (“quick”, “brown”, “fox”)

  • Extracted: “the brown quick fox”

  • Extracted trigrams: (“the”, “brown”, “quick”), (“brown”, “quick”, “fox”)

  • Matching: none

  • Trigram Overlap: 0/2 = 0.0

Interpretation:

  • Usually lower than bigram overlap

  • More strict ordering requirement

  • Good for detailed analysis

When to use:

  • Fine-grained ordering analysis

  • Detailed pipeline comparison

  • Academic evaluation


Metric Trade-offs

Precision vs. Recall

Different pipelines optimize for different trade-offs:

High Precision, Lower Recall (Conservative):

  • Example: Baseline Tesseract (Precision: 0.615, Recall: 0.599)

  • Few false positives, but misses some text

  • Best for: Clean text applications, noise-sensitive systems

Lower Precision, High Recall (Aggressive):

  • Example: Heron + Tesseract (Precision: 0.384, Recall: 0.810)

  • Finds most text, but includes noise

  • Best for: Legal/compliance, maximum extraction

Balanced:

  • Example: PaddleOCR (Precision: 0.792, Recall: 0.782, F1: 0.787)

  • Good accuracy and completeness

  • Best for: General-purpose applications

Accuracy vs. Reading Order

You can have high F1 (finds words) but poor WER (wrong order):

Good F1, Poor WER:

  • Finds the right words but in wrong order

  • Common in multi-column documents

  • Example: Column text mixed together

Good F1, Good WER:

  • Finds the right words in right order

  • Ideal for most applications

  • Example: PaddleOCR with layout detection

Poor F1, Good WER:

  • Rare but possible - finds few words but in correct order

  • Usually indicates incomplete extraction


Choosing Metrics for Your Use Case

Forms (FUNSD-like)

Primary Metric: F1 Score

  • Measures field extraction accuracy

  • Balance of precision and recall matters

Secondary Metrics:

  • Recall (don’t miss fields)

  • WER (fields should be in order)

Target:

  • F1 ≥ 0.75 for production

  • Recall ≥ 0.70 minimum


Receipts (Dense Text)

Primary Metric: F1 Score

  • Entity extraction accuracy critical

  • Dense text needs high precision

Secondary Metrics:

  • Precision (avoid noise in entities)

  • Bigram (local ordering for amounts/dates)

Target:

  • F1 ≥ 0.80 for production

  • Precision ≥ 0.75 minimum


Academic Papers (Multi-Column)

Primary Metric: LCS Ratio

  • Reading order preservation critical

  • Multi-column layout understanding

Secondary Metrics:

  • F1 (overall accuracy)

  • WER (sequence quality)

  • Bigram (column mixing detection)

Target:

  • LCS ≥ 0.75 for production

  • Bigram ≥ 0.60 minimum



Interpreting Benchmark Results

Example Report

{
  "pipeline_name": "paddleocr",
  "metrics": {
    "set_based": {
      "avg_precision": 0.792,
      "avg_recall": 0.782,
      "avg_f1": 0.787
    },
    "order_aware": {
      "avg_wer": 0.533,
      "avg_sequence_accuracy": 0.031,
      "avg_lcs_ratio": 0.621
    },
    "ngram": {
      "avg_bigram_overlap": 0.521,
      "avg_trigram_overlap": 0.412
    }
  }
}

Interpretation

Set-Based Metrics:

  • F1: 0.787 → Excellent accuracy (finds 78.7% of words correctly)

  • Balanced precision (79.2%) and recall (78.2%)

Order-Aware Metrics:

  • WER: 0.533 → Good reading order (53% error rate acceptable for forms)

  • LCS: 0.621 → Good longest sequence preservation

N-gram Metrics:

  • Bigram: 0.521 → Good local ordering (52% word pairs correct)

Overall: Strong all-around pipeline. High F1 for accuracy, acceptable reading order for forms.


Comparing Two Pipelines

Scenario: PaddleOCR vs. Heron+Tesseract

Metric

PaddleOCR

Heron+Tesseract

Winner

F1

0.787

0.519

PaddleOCR

Recall

0.782

0.810

Heron

Precision

0.792

0.384

PaddleOCR

WER

0.533

0.612

PaddleOCR

Bigram

0.521

0.561

Heron

Analysis:

  • PaddleOCR: Better overall accuracy (F1), cleaner output (precision), better reading order (WER)

  • Heron+Tesseract: Finds more text (recall), better local ordering (bigram)

Choose PaddleOCR if: You need clean, accurate extraction Choose Heron if: You need maximum extraction, can tolerate noise


Metric Calculation Details

Set-Based Calculation

Words are normalized before comparison:

  • Lowercase

  • Remove punctuation

  • Trim whitespace

Example:

  • Ground truth: “Hello, World!”

  • Extracted: “hello world”

  • Match: Both normalize to [“hello”, “world”]

Order-Aware Calculation

Word sequences are compared position-by-position:

  • Insertions: Extra words in extracted text

  • Deletions: Missing words from ground truth

  • Substitutions: Wrong words in extracted text

Example:

  • Ground truth: [“the”, “quick”, “fox”]

  • Extracted: [“the”, “slow”, “fox”]

  • Operations: 1 substitution (quick→slow)

  • WER: 1/3 = 0.333

N-gram Calculation

Sliding window over word sequences:

  • Bigram: Window size 2

  • Trigram: Window size 3

Example bigram calculation:

  • Ground truth: [“a”, “b”, “c”]

  • Bigrams: [(“a”, “b”), (“b”, “c”)]

  • Extracted: [“a”, “c”, “b”]

  • Bigrams: [(“a”, “c”), (“c”, “b”)]

  • Matching: none

  • Bigram overlap: 0/2 = 0.0


Tools and Libraries

Biblicus uses:

  • editdistance (optional): Fast Levenshtein distance for WER

  • difflib: Python standard library for sequence matching

  • nltk (optional): Advanced text normalization

Without editdistance, WER calculation falls back to difflib (slower but works).


Next Steps


References