RapidOCR Extractor

Extractor ID: ocr-rapidocr

Overview

The RapidOCR extractor performs optical character recognition on image files using the RapidOCR library with ONNX Runtime. It provides fast, accurate OCR without requiring external services or GPU acceleration.

RapidOCR is built on ONNX Runtime and uses optimized OCR models for efficient text detection and recognition. It’s ideal for processing image corpora where embedded text needs to be extracted for search or analysis.

Installation

RapidOCR is an optional dependency:

pip install "biblicus[ocr]"

This installs rapidocr-onnxruntime which includes all necessary models and the ONNX Runtime.

Supported Media Types

image/png - PNG images
image/jpeg - JPEG/JPG images
image/gif - GIF images
image/bmp - BMP images
image/tiff - TIFF images
image/webp - WebP images

Only image media types are processed. Other media types are automatically skipped.

Configuration

Config Schema

class RapidOcrExtractorConfig(BaseModel):
    min_confidence: float = 0.5  # Minimum confidence threshold (0.0-1.0)
    joiner: str = "\n"            # String to join recognized lines

Configuration Options

Option	Type	Default	Description
`min_confidence`	float	`0.5`	Minimum per-line confidence to include (0.0-1.0)
`joiner`	str	`"\n"`	String used to join recognized text lines

Usage

Command Line

Basic Usage

# Extract text from images
biblicus extract my-corpus --extractor ocr-rapidocr

Custom Configuration

# Higher confidence threshold
biblicus extract my-corpus --extractor ocr-rapidocr \
  --config min_confidence=0.75

# Use space as joiner instead of newline
biblicus extract my-corpus --extractor ocr-rapidocr \
  --config joiner=" "

Configuration File

extractor_id: ocr-rapidocr
config:
  min_confidence: 0.6
  joiner: "\n"

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="ocr-rapidocr")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="ocr-rapidocr",
    config={
        "min_confidence": 0.7,
        "joiner": " "
    }
)

In Pipeline

OCR Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr

Examples

Screenshot Collection

Extract text from screenshots:

biblicus extract screenshots --extractor ocr-rapidocr

Scanned Documents

Process scanned document images:

biblicus extract scans --extractor ocr-rapidocr \
  --config min_confidence=0.7

Document Photos

Extract text from photos of documents:

from biblicus import Corpus

corpus = Corpus.from_directory("document-photos")

results = corpus.extract_text(
    extractor_id="ocr-rapidocr",
    config={"min_confidence": 0.6}
)

High-Confidence Extraction

Only include very confident results:

biblicus extract images --extractor ocr-rapidocr \
  --config min_confidence=0.9

Confidence Scores

RapidOCR provides per-line confidence scores:

Confidence Range: 0.0 to 1.0
Default Threshold: 0.5 (50%)
Returned Confidence: Average of accepted lines

The extractor:

Recognizes text lines with individual confidence scores
Filters lines below min_confidence threshold
Returns the average confidence of accepted lines

Interpreting Confidence

0.9-1.0: Excellent recognition
0.7-0.9: Good recognition
0.5-0.7: Acceptable recognition
0.0-0.5: Poor recognition (filtered by default)

Performance

Speed: Fast (0.5-2 seconds per image)
Memory: Moderate (models loaded once)
Accuracy: Good for clear text, moderate for degraded images

RapidOCR is significantly faster than VLM approaches while maintaining good accuracy for standard OCR tasks.

Error Handling

Missing Dependency

If RapidOCR is not installed:

ExtractionRunFatalError: RapidOCR extractor requires an optional dependency.
Install it with pip install "biblicus[ocr]".

Non-Image Items

Non-image items are silently skipped (returns None).

No Text Recognized

Images without recognizable text produce empty extracted text and are counted in extracted_empty_items.

Per-Item Errors

Processing errors for individual images are recorded but don’t halt extraction.

Use Cases

Screenshot Archives

Extract text from UI screenshots:

biblicus extract screenshots --extractor ocr-rapidocr

Scanned Document Collections

Process scanned paper documents:

biblicus extract scans --extractor ocr-rapidocr

Photo Documentation

Extract text from photos of documents or signs:

biblicus extract photos --extractor ocr-rapidocr

Mixed Media Pipeline

Combine with other extractors:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

When to Use RapidOCR vs Alternatives

Use RapidOCR when:

Images contain primarily text
You need fast, local OCR
Text is reasonably clear
No GPU is required

Use PaddleOCR-VL when:

Text is in CJK languages (Chinese, Japanese, Korean)
You need better accuracy for complex layouts
API-based processing is acceptable

Use VLM extractors when:

Images have complex layouts
You need document understanding beyond text
Tables, equations, or diagrams are present
Highest accuracy is critical

Use text extractors when:

Documents have embedded text layers
PDFs are born-digital (not scanned)
You want instant extraction

Best Practices

Tune Confidence Threshold

Test different thresholds on sample images:

# Try different confidence levels
biblicus extract test-images --extractor ocr-rapidocr \
  --config min_confidence=0.7

Monitor Confidence Scores

Check average confidence in results:

results = corpus.extract_text(extractor_id="ocr-rapidocr")
# Confidence is available in extraction metadata

Use for Clear Text

RapidOCR works best with:

Clear, high-resolution images
Good lighting/contrast
Standard fonts
Horizontal text orientation

Consider Alternatives for:

Very low quality images
Complex multi-column layouts
Mixed text/graphics
Rotated or skewed text

Image Quality Tips

For best OCR results:

Resolution: 300+ DPI preferred
Contrast: High contrast between text and background
Clarity: Sharp focus, not blurry
Orientation: Straight, not skewed
Lighting: Even illumination