# Optical Character Recognition (OCR) Traditional OCR extractors for text recognition from images and scanned documents. ```{toctree} :maxdepth: 1 :caption: OCR Extractors rapidocr paddleocr-vl ``` ```{toctree} :maxdepth: 1 :caption: Benchmarking /guides/document-understanding-benchmark /guides/ocr-benchmarking /guides/benchmark-results /guides/heron-implementation /guides/layout-aware-ocr-results ``` ## Overview OCR extractors use computer vision to recognize text in images, scanned documents, and visual content. They are ideal for: - Scanned PDFs without text layers - Photographs of documents - Screenshots with text - Handwritten content (with appropriate models) ## Available Extractors ### [ocr-rapidocr](rapidocr.md) RapidOCR provides fast ONNX-based text recognition with: - Multi-language support - Fast inference (CPU-optimized) - No GPU required - Lightweight deployment **Installation**: `pip install biblicus[ocr]` **Best for**: General-purpose OCR, scanned documents, mixed text/image content ### [ocr-paddleocr-vl](paddleocr-vl.md) PaddleOCR vision-language model provides: - Advanced document understanding - Layout analysis - Table detection - Chinese/English/multilingual support **Installation**: `pip install biblicus[paddleocr]` **Best for**: Complex documents, tables, multi-column layouts, CJK text ## OCR vs VLM Document Understanding ### When to Use OCR - Simple text recognition needs - CPU-only environments - Fast processing requirements - Lightweight deployments ### When to Use VLM For advanced document understanding with layout preservation, use [VLM extractors](../vlm-document/index.md): - [docling-smol](../vlm-document/docling-smol.md) - Fast, 256M params - [docling-granite](../vlm-document/docling-granite.md) - High accuracy, 258M params VLM extractors provide: - Semantic structure understanding - Equation and code block recognition - Superior table extraction - Layout-aware markdown output ## Choosing an Extractor | Use Case | Recommended Extractor | Notes | |----------|----------------------|-------| | English scanned docs | [ocr-rapidocr](rapidocr.md) | Fast, lightweight | | Chinese/CJK documents | [ocr-paddleocr-vl](paddleocr-vl.md) | Excellent CJK support | | Tables and complex layouts | [docling-granite](../vlm-document/docling-granite.md) | VLM approach | | Simple screenshots | [ocr-rapidocr](rapidocr.md) | Quick results | | Academic papers with equations | [docling-granite](../vlm-document/docling-granite.md) | Equation recognition | ## Common Patterns ### Fallback Chain Try VLM first, fall back to OCR: ```yaml extractor_id: select-text config: extractors: - docling-smol - ocr-rapidocr ``` ### Multi-Strategy Selection Use longest output from multiple OCR approaches: ```yaml extractor_id: select-longest-text config: extractors: - ocr-rapidocr - ocr-paddleocr-vl ``` ### Document Type Routing Use smart overrides for different document types: ```yaml extractor_id: select-smart-override config: default_extractor: ocr-rapidocr overrides: - media_type_pattern: "image/.*" extractor: ocr-rapidocr - media_type_pattern: "application/pdf" extractor: docling-smol ``` ## Performance Considerations ### RapidOCR - **Speed**: Very fast (CPU-optimized ONNX) - **Memory**: Low (~100MB models) - **Accuracy**: Good for clean scans - **Hardware**: CPU-only ### PaddleOCR VL - **Speed**: Moderate (requires Paddle framework) - **Memory**: Higher (~500MB models) - **Accuracy**: Excellent for complex layouts - **Hardware**: CPU or GPU ### VLM Alternatives For best accuracy with complex documents, consider [VLM extractors](../vlm-document/index.md) which offer: - Better layout understanding - Semantic structure preservation - Superior table and equation handling ## See Also - [Extractors Overview](../index.md) - [VLM Document Understanding](../vlm-document/index.md) - Advanced document processing - [Text & Document Processing](../text-document/index.md) - For PDFs with text layers - [Pipeline Utilities](../pipeline-utilities/index.md) - For combining strategies