Optical Character Recognition (OCR)

Traditional OCR extractors for text recognition from images and scanned documents.

Overview

OCR extractors use computer vision to recognize text in images, scanned documents, and visual content. They are ideal for:

  • Scanned PDFs without text layers

  • Photographs of documents

  • Screenshots with text

  • Handwritten content (with appropriate models)

Available Extractors

ocr-rapidocr

RapidOCR provides fast ONNX-based text recognition with:

  • Multi-language support

  • Fast inference (CPU-optimized)

  • No GPU required

  • Lightweight deployment

Installation: pip install biblicus[ocr]

Best for: General-purpose OCR, scanned documents, mixed text/image content

ocr-paddleocr-vl

PaddleOCR vision-language model provides:

  • Advanced document understanding

  • Layout analysis

  • Table detection

  • Chinese/English/multilingual support

Installation: pip install biblicus[paddleocr]

Best for: Complex documents, tables, multi-column layouts, CJK text

OCR vs VLM Document Understanding

When to Use OCR

  • Simple text recognition needs

  • CPU-only environments

  • Fast processing requirements

  • Lightweight deployments

When to Use VLM

For advanced document understanding with layout preservation, use VLM extractors:

VLM extractors provide:

  • Semantic structure understanding

  • Equation and code block recognition

  • Superior table extraction

  • Layout-aware markdown output

Choosing an Extractor

Use Case

Recommended Extractor

Notes

English scanned docs

ocr-rapidocr

Fast, lightweight

Chinese/CJK documents

ocr-paddleocr-vl

Excellent CJK support

Tables and complex layouts

docling-granite

VLM approach

Simple screenshots

ocr-rapidocr

Quick results

Academic papers with equations

docling-granite

Equation recognition

Common Patterns

Fallback Chain

Try VLM first, fall back to OCR:

extractor_id: select-text
config:
  extractors:
    - docling-smol
    - ocr-rapidocr

Multi-Strategy Selection

Use longest output from multiple OCR approaches:

extractor_id: select-longest-text
config:
  extractors:
    - ocr-rapidocr
    - ocr-paddleocr-vl

Document Type Routing

Use smart overrides for different document types:

extractor_id: select-smart-override
config:
  default_extractor: ocr-rapidocr
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
    - media_type_pattern: "application/pdf"
      extractor: docling-smol

Performance Considerations

RapidOCR

  • Speed: Very fast (CPU-optimized ONNX)

  • Memory: Low (~100MB models)

  • Accuracy: Good for clean scans

  • Hardware: CPU-only

PaddleOCR VL

  • Speed: Moderate (requires Paddle framework)

  • Memory: Higher (~500MB models)

  • Accuracy: Excellent for complex layouts

  • Hardware: CPU or GPU

VLM Alternatives

For best accuracy with complex documents, consider VLM extractors which offer:

  • Better layout understanding

  • Semantic structure preservation

  • Superior table and equation handling

See Also