# Heron Layout Detection Implementation - COMPLETE **Date:** 2026-02-03 **Status:** ✅ Fully Implemented and Benchmarked ## Summary Successfully implemented IBM Research's Heron layout detection models - the ACTUAL tool was described in their OCR workflow. The complete pipeline (Heron + Tesseract) has been implemented, tested, and benchmarked. --- ## What is Heron? - **IBM Research's state-of-the-art document layout analysis models** - Released September 2025 (arXiv:2509.11720) - Part of the Docling project - Publicly available on HuggingFace under Apache 2.0 license - 686K+ downloads/month **Models:** - `ds4sd/docling-layout-heron` (42.9M params) - `ds4sd/docling-layout-heron-101` (76.7M params, 78% mAP) --- ## Implementation ### Files Created/Modified **New Files:** 1. [`src/biblicus/extractors/heron_layout.py`](src/biblicus/extractors/heron_layout.py) - Heron layout detection extractor 2. [`configs/heron-tesseract.yaml`](configs/heron-tesseract.yaml) - Pipeline configuration 3. [`scripts/test_heron_pipeline.py`](scripts/test_heron_pipeline.py) - Test script 4. [`scripts/benchmark_heron_vs_paddleocr.py`](scripts/benchmark_heron_vs_paddleocr.py) - Benchmark script 5. [`results/heron_tesseract.json`](results/heron_tesseract.json) - Benchmark results 6. [`results/heron_vs_paddleocr_comparison.json`](results/heron_vs_paddleocr_comparison.json) - Comparison 7. [`docs/guides/ocr-benchmarking.md`](docs/guides/ocr-benchmarking.md) - Complete benchmarking guide **Modified Files:** 1. [`src/biblicus/extractors/__init__.py`](src/biblicus/extractors/__init__.py) - Registered HeronLayoutExtractor 2. [`benchmark-results.md`](benchmark-results.md) - Added Heron results 3. [`GITHUB_ISSUES.md`](GITHUB_ISSUES.md) - Updated Issue #4 with completion status --- ## workflow-based Complete Workflow Status ### Part 1: Docling for Tables/Formulas ✅ WORKING - **Pipeline:** `docling-smol` - **Performance:** F1: 0.728, Recall: 0.675 - **Rank:** #2 out of 9 pipelines ### Part 2: Gibberish Filtering 📋 DOCUMENTED - **Status:** Pattern documented in [GITHUB_ISSUES.md](GITHUB_ISSUES.md) Issue #1 - **Implementation:** Future work ### Part 3: Heron + Tesseract ✅ WORKING - **Pipeline:** `heron-tesseract` - **Performance:** F1: 0.519, Recall: 0.810 (HIGHEST) - **What it does:** Layout detection → Region-based OCR → Text reconstruction --- ## Benchmark Results Tested on 5 FUNSD scanned form documents with ground truth annotations. ### Heron-101 + Tesseract Performance | Metric | Score | Comparison | |--------|-------|------------| | **F1 Score** | 0.519 | -13.7% vs PaddleOCR layout | | **Recall** | **0.810** | **+10.6% vs PaddleOCR layout** ⭐ HIGHEST | | **Precision** | 0.384 | -26.6% vs PaddleOCR layout | | **Word Error Rate** | 5.324 | +74.2% vs PaddleOCR layout | | **Sequence Accuracy** | 0.012 | -71.9% vs PaddleOCR layout | | **Bigram Overlap** | **0.561** | **+14.4% vs PaddleOCR layout** ⭐ BEST | ### Comparison: All Layout-Aware Approaches | Pipeline | F1 | Recall | Precision | Bigram | |----------|-----|--------|-----------|--------| | **Heron + Tesseract** | 0.519 | **0.810** | 0.384 | **0.561** | | PaddleOCR + Tesseract | 0.601 | 0.732 | 0.523 | 0.491 | | Baseline Tesseract | 0.607 | 0.599 | 0.626 | 0.350 | --- ## Key Findings ### Heron's Strengths ✅ **Highest Recall (0.810)** - Finds 81% of all words in ground truth - 3.6% better than PaddleOCR layout - 35% better than baseline Tesseract ✅ **Best Local Ordering (Bigram: 0.561)** - 14.4% better than PaddleOCR layout - 60% better than baseline Tesseract - Preserves word pair relationships best ✅ **Aggressive Layout Detection** - Detects 24 regions vs 8 for PaddleOCR - Catches text other methods miss - Perfect for completeness-critical applications ### Heron's Trade-offs ⚠️ **Lower Precision (0.384)** - More false positives due to aggressive detection - Introduces more noise than other methods ⚠️ **Higher Word Error Rate (5.324)** - More insertions, deletions, substitutions - Result of processing 3x more regions ⚠️ **Lower F1 Score (0.519)** - Precision/recall trade-off - Prioritizes completeness over accuracy --- ## When to Use Heron ### ✅ Use Heron When: - **Missing content is worse than having noise** - Completeness matters more than perfect accuracy - Processing documents where every word counts - Post-processing can clean up false positives (workflow part) - You need the best local word ordering ### ⚠️ Use PaddleOCR When: - Accuracy matters more than completeness - F1 score is the primary metric - Lower error rate is critical ### ✅ Use Direct PaddleOCR When: - You want best overall performance (F1: 0.787) - Don't need separate layout detection - Speed and accuracy both matter --- ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Heron + Tesseract Pipeline │ └─────────────────────────────────────────────────────────┘ Step 1: Heron Layout Detection ┌──────────────────────────────────────┐ │ IBM Heron-101 (RT-DETR V2) │ │ - Input: Document image │ │ - Output: 24 regions with: │ │ * Bounding boxes [x1,y1,x2,y2] │ │ * Region types (17 classes) │ │ * Confidence scores │ │ * Reading order │ └──────────────────────────────────────┘ ↓ Layout Metadata ↓ Step 2: Region-Based OCR ┌──────────────────────────────────────┐ │ Tesseract OCR │ │ - For each region in order: │ │ * Crop image to bbox │ │ * Run OCR on region │ │ * Collect text │ └──────────────────────────────────────┘ ↓ Step 3: Text Reconstruction ┌──────────────────────────────────────┐ │ Combine in Reading Order │ │ - Join text from all regions │ │ - Preserve layout-detected order │ │ - Output: Complete document text │ └──────────────────────────────────────┘ ``` --- ## Usage ### Test Pipeline ```bash python scripts/test_heron_pipeline.py ``` **Expected output:** ``` ✓ Heron layout detection stage exists ✓ Found 5 layout metadata files ✓ Tesseract OCR stage exists ✓ Found 5 text output files SUCCESS: Heron + Tesseract pipeline is working! ``` ### Benchmark ```bash python scripts/benchmark_heron_vs_paddleocr.py ``` ### Use in Pipeline ```yaml # configs/my-heron-pipeline.yaml extractor_id: pipeline config: stages: - extractor_id: heron-layout config: model_variant: "101" # or "base" for faster/lighter confidence_threshold: 0.6 - extractor_id: ocr-tesseract config: use_layout_metadata: true lang: eng ``` ```python from biblicus import Corpus from biblicus.extraction import build_extraction_snapshot from pathlib import Path corpus = Corpus(Path("my_corpus").resolve()) snapshot = build_extraction_snapshot( corpus, extractor_id="pipeline", configuration_name="heron-tesseract", configuration=config["config"] ) ``` --- ## Dependencies ```bash # Install Heron dependencies pip install "transformers>=4.40.0" "torch>=2.0.0" # First run downloads model (~150MB) python scripts/test_heron_pipeline.py ``` **Model download locations:** - Heron-101: `~/.cache/huggingface/hub/models--ds4sd--docling-layout-heron-101` - Heron-base: `~/.cache/huggingface/hub/models--ds4sd--docling-layout-heron` --- ## Comparison with Original Research From the Heron paper (arXiv:2509.11720): - **Reported mAP:** 78% on DocLayNet dataset - **Inference speed:** 28ms/image on A100 GPU - **Training data:** 150K documents **Our implementation:** - Uses Heron-101 (76.7M params) - Achieves 0.810 recall on FUNSD (highest of all methods) - Detects 24 regions on average (vs 8 for PaddleOCR) --- ## Future Work ### Multi-Column Documents Current benchmarks use FUNSD (single-column forms). Layout detection should show even larger benefits on: - Academic papers (two-column format) - Newspapers (complex multi-column layouts) - Technical documents with mixed content ### Gibberish Filtering Implement workflow-based Part 2 to clean up false positives from aggressive layout detection. This would improve Heron's precision while maintaining high recall. ### Heron-base Benchmark Test the lighter Heron-base model (42.9M params) for speed/accuracy trade-off. --- ## References - **Heron paper:** https://arxiv.org/abs/2509.11720 - **Heron-101 model:** https://huggingface.co/ds4sd/docling-layout-heron-101 - **Benchmark results:** [benchmark-results.md](benchmark-results.md) - **Benchmarking guide:** [docs/guides/ocr-benchmarking.md](docs/guides/ocr-benchmarking.md) --- ## Conclusion ✅ **workflow-based Heron + Tesseract workflow is fully implemented and working** The implementation successfully replicates workflow-based production workflow for handling non-selectable files with complex layouts. Heron achieves the highest recall of any tested method (0.810), making it ideal for use cases where finding all text is more important than perfect accuracy. The trade-off between recall and precision is well-understood and documented. Combined with gibberish filtering (Part 2 of two-stage layout-aware workflow), this approach provides a robust solution for challenging OCR tasks. **Bottom line:** If you need to find ALL the text and can tolerate some noise, use Heron. If you need the best overall accuracy, use direct PaddleOCR (F1: 0.787).