RapidOCR Extractor
Extractor ID: ocr-rapidocr
Category: OCR Extractors
Overview
The RapidOCR extractor performs optical character recognition on image files using the RapidOCR library with ONNX Runtime. It provides fast, accurate OCR without requiring external services or GPU acceleration.
RapidOCR is built on ONNX Runtime and uses optimized OCR models for efficient text detection and recognition. It’s ideal for processing image corpora where embedded text needs to be extracted for search or analysis.
Installation
RapidOCR is an optional dependency:
pip install "biblicus[ocr]"
This installs rapidocr-onnxruntime which includes all necessary models and the ONNX Runtime.
Supported Media Types
image/png- PNG imagesimage/jpeg- JPEG/JPG imagesimage/gif- GIF imagesimage/bmp- BMP imagesimage/tiff- TIFF imagesimage/webp- WebP images
Only image media types are processed. Other media types are automatically skipped.
Configuration
Config Schema
class RapidOcrExtractorConfig(BaseModel):
min_confidence: float = 0.5 # Minimum confidence threshold (0.0-1.0)
joiner: str = "\n" # String to join recognized lines
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
Minimum per-line confidence to include (0.0-1.0) |
|
str |
|
String used to join recognized text lines |
Usage
Command Line
Basic Usage
# Extract text from images
biblicus extract my-corpus --extractor ocr-rapidocr
Custom Configuration
# Higher confidence threshold
biblicus extract my-corpus --extractor ocr-rapidocr \
--config min_confidence=0.75
# Use space as joiner instead of newline
biblicus extract my-corpus --extractor ocr-rapidocr \
--config joiner=" "
Configuration File
extractor_id: ocr-rapidocr
config:
min_confidence: 0.6
joiner: "\n"
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="ocr-rapidocr")
# Extract with custom config
results = corpus.extract_text(
extractor_id="ocr-rapidocr",
config={
"min_confidence": 0.7,
"joiner": " "
}
)
In Pipeline
OCR Fallback
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: ocr-rapidocr
- extractor_id: select-text
Media Type Routing
extractor_id: select-smart-override
config:
default_extractor: pass-through-text
overrides:
- media_type_pattern: "image/.*"
extractor: ocr-rapidocr
Examples
Screenshot Collection
Extract text from screenshots:
biblicus extract screenshots --extractor ocr-rapidocr
Scanned Documents
Process scanned document images:
biblicus extract scans --extractor ocr-rapidocr \
--config min_confidence=0.7
Document Photos
Extract text from photos of documents:
from biblicus import Corpus
corpus = Corpus.from_directory("document-photos")
results = corpus.extract_text(
extractor_id="ocr-rapidocr",
config={"min_confidence": 0.6}
)
High-Confidence Extraction
Only include very confident results:
biblicus extract images --extractor ocr-rapidocr \
--config min_confidence=0.9
Confidence Scores
RapidOCR provides per-line confidence scores:
Confidence Range: 0.0 to 1.0
Default Threshold: 0.5 (50%)
Returned Confidence: Average of accepted lines
The extractor:
Recognizes text lines with individual confidence scores
Filters lines below
min_confidencethresholdReturns the average confidence of accepted lines
Interpreting Confidence
0.9-1.0: Excellent recognition
0.7-0.9: Good recognition
0.5-0.7: Acceptable recognition
0.0-0.5: Poor recognition (filtered by default)
Performance
Speed: Fast (0.5-2 seconds per image)
Memory: Moderate (models loaded once)
Accuracy: Good for clear text, moderate for degraded images
RapidOCR is significantly faster than VLM approaches while maintaining good accuracy for standard OCR tasks.
Error Handling
Missing Dependency
If RapidOCR is not installed:
ExtractionRunFatalError: RapidOCR extractor requires an optional dependency.
Install it with pip install "biblicus[ocr]".
Non-Image Items
Non-image items are silently skipped (returns None).
No Text Recognized
Images without recognizable text produce empty extracted text and are counted in extracted_empty_items.
Per-Item Errors
Processing errors for individual images are recorded but don’t halt extraction.
Use Cases
Screenshot Archives
Extract text from UI screenshots:
biblicus extract screenshots --extractor ocr-rapidocr
Scanned Document Collections
Process scanned paper documents:
biblicus extract scans --extractor ocr-rapidocr
Photo Documentation
Extract text from photos of documents or signs:
biblicus extract photos --extractor ocr-rapidocr
Mixed Media Pipeline
Combine with other extractors:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-text
When to Use RapidOCR vs Alternatives
Use RapidOCR when:
Images contain primarily text
You need fast, local OCR
Text is reasonably clear
No GPU is required
Use PaddleOCR-VL when:
Text is in CJK languages (Chinese, Japanese, Korean)
You need better accuracy for complex layouts
API-based processing is acceptable
Use VLM extractors when:
Images have complex layouts
You need document understanding beyond text
Tables, equations, or diagrams are present
Highest accuracy is critical
Use text extractors when:
Documents have embedded text layers
PDFs are born-digital (not scanned)
You want instant extraction
Best Practices
Tune Confidence Threshold
Test different thresholds on sample images:
# Try different confidence levels
biblicus extract test-images --extractor ocr-rapidocr \
--config min_confidence=0.7
Monitor Confidence Scores
Check average confidence in results:
results = corpus.extract_text(extractor_id="ocr-rapidocr")
# Confidence is available in extraction metadata
Use for Clear Text
RapidOCR works best with:
Clear, high-resolution images
Good lighting/contrast
Standard fonts
Horizontal text orientation
Consider Alternatives for:
Very low quality images
Complex multi-column layouts
Mixed text/graphics
Rotated or skewed text
Image Quality Tips
For best OCR results:
Resolution: 300+ DPI preferred
Contrast: High contrast between text and background
Clarity: Sharp focus, not blurry
Orientation: Straight, not skewed
Lighting: Even illumination
See Also
extraction.md - Extraction pipeline concepts