# RapidOCR Extractor

**Extractor ID:** `ocr-rapidocr`

**Category:** [OCR Extractors](index.md)

## Overview

The RapidOCR extractor performs optical character recognition on image files using the RapidOCR library with ONNX Runtime. It provides fast, accurate OCR without requiring external services or GPU acceleration.

RapidOCR is built on ONNX Runtime and uses optimized OCR models for efficient text detection and recognition. It's ideal for processing image corpora where embedded text needs to be extracted for search or analysis.

## Installation

RapidOCR is an optional dependency:

```bash
pip install "biblicus[ocr]"
```

This installs `rapidocr-onnxruntime` which includes all necessary models and the ONNX Runtime.

## Supported Media Types

- `image/png` - PNG images
- `image/jpeg` - JPEG/JPG images
- `image/gif` - GIF images
- `image/bmp` - BMP images
- `image/tiff` - TIFF images
- `image/webp` - WebP images

Only image media types are processed. Other media types are automatically skipped.

## Configuration

### Config Schema

```python
class RapidOcrExtractorConfig(BaseModel):
    min_confidence: float = 0.5  # Minimum confidence threshold (0.0-1.0)
    joiner: str = "\n"            # String to join recognized lines
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `min_confidence` | float | `0.5` | Minimum per-line confidence to include (0.0-1.0) |
| `joiner` | str | `"\n"` | String used to join recognized text lines |

## Usage

### Command Line

#### Basic Usage

```bash
# Extract text from images
biblicus extract my-corpus --extractor ocr-rapidocr
```

#### Custom Configuration

```bash
# Higher confidence threshold
biblicus extract my-corpus --extractor ocr-rapidocr \
  --config min_confidence=0.75

# Use space as joiner instead of newline
biblicus extract my-corpus --extractor ocr-rapidocr \
  --config joiner=" "
```

#### Configuration File

```yaml
extractor_id: ocr-rapidocr
config:
  min_confidence: 0.6
  joiner: "\n"
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="ocr-rapidocr")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="ocr-rapidocr",
    config={
        "min_confidence": 0.7,
        "joiner": " "
    }
)
```

### In Pipeline

#### OCR Fallback

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text
```

#### Media Type Routing

```yaml
extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-rapidocr
```

## Examples

### Screenshot Collection

Extract text from screenshots:

```bash
biblicus extract screenshots --extractor ocr-rapidocr
```

### Scanned Documents

Process scanned document images:

```bash
biblicus extract scans --extractor ocr-rapidocr \
  --config min_confidence=0.7
```

### Document Photos

Extract text from photos of documents:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("document-photos")

results = corpus.extract_text(
    extractor_id="ocr-rapidocr",
    config={"min_confidence": 0.6}
)
```

### High-Confidence Extraction

Only include very confident results:

```bash
biblicus extract images --extractor ocr-rapidocr \
  --config min_confidence=0.9
```

## Confidence Scores

RapidOCR provides per-line confidence scores:

- **Confidence Range**: 0.0 to 1.0
- **Default Threshold**: 0.5 (50%)
- **Returned Confidence**: Average of accepted lines

The extractor:
1. Recognizes text lines with individual confidence scores
2. Filters lines below `min_confidence` threshold
3. Returns the average confidence of accepted lines

### Interpreting Confidence

- **0.9-1.0**: Excellent recognition
- **0.7-0.9**: Good recognition
- **0.5-0.7**: Acceptable recognition
- **0.0-0.5**: Poor recognition (filtered by default)

## Performance

- **Speed**: Fast (0.5-2 seconds per image)
- **Memory**: Moderate (models loaded once)
- **Accuracy**: Good for clear text, moderate for degraded images

RapidOCR is significantly faster than VLM approaches while maintaining good accuracy for standard OCR tasks.

## Error Handling

### Missing Dependency

If RapidOCR is not installed:

```
ExtractionRunFatalError: RapidOCR extractor requires an optional dependency.
Install it with pip install "biblicus[ocr]".
```

### Non-Image Items

Non-image items are silently skipped (returns `None`).

### No Text Recognized

Images without recognizable text produce empty extracted text and are counted in `extracted_empty_items`.

### Per-Item Errors

Processing errors for individual images are recorded but don't halt extraction.

## Use Cases

### Screenshot Archives

Extract text from UI screenshots:

```bash
biblicus extract screenshots --extractor ocr-rapidocr
```

### Scanned Document Collections

Process scanned paper documents:

```bash
biblicus extract scans --extractor ocr-rapidocr
```

### Photo Documentation

Extract text from photos of documents or signs:

```bash
biblicus extract photos --extractor ocr-rapidocr
```

### Mixed Media Pipeline

Combine with other extractors:

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text
```

## When to Use RapidOCR vs Alternatives

### Use RapidOCR when:
- Images contain primarily text
- You need fast, local OCR
- Text is reasonably clear
- No GPU is required

### Use PaddleOCR-VL when:
- Text is in CJK languages (Chinese, Japanese, Korean)
- You need better accuracy for complex layouts
- API-based processing is acceptable

### Use VLM extractors when:
- Images have complex layouts
- You need document understanding beyond text
- Tables, equations, or diagrams are present
- Highest accuracy is critical

### Use text extractors when:
- Documents have embedded text layers
- PDFs are born-digital (not scanned)
- You want instant extraction

## Best Practices

### Tune Confidence Threshold

Test different thresholds on sample images:

```bash
# Try different confidence levels
biblicus extract test-images --extractor ocr-rapidocr \
  --config min_confidence=0.7
```

### Monitor Confidence Scores

Check average confidence in results:

```python
results = corpus.extract_text(extractor_id="ocr-rapidocr")
# Confidence is available in extraction metadata
```

### Use for Clear Text

RapidOCR works best with:
- Clear, high-resolution images
- Good lighting/contrast
- Standard fonts
- Horizontal text orientation

### Consider Alternatives for:
- Very low quality images
- Complex multi-column layouts
- Mixed text/graphics
- Rotated or skewed text

## Image Quality Tips

For best OCR results:
- **Resolution**: 300+ DPI preferred
- **Contrast**: High contrast between text and background
- **Clarity**: Sharp focus, not blurry
- **Orientation**: Straight, not skewed
- **Lighting**: Even illumination

## Related Extractors

### Same Category

- [ocr-paddleocr-vl](paddleocr-vl.md) - PaddleOCR VL with better CJK support

### Alternatives

- [docling-smol](../vlm-document/docling-smol.md) - Fast VLM for complex documents
- [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM
- [pdf-text](../text-document/pdf.md) - Fast text extraction from PDFs
- [markitdown](../text-document/markitdown.md) - Office document conversion

### Pipeline Utilities

- [select-text](../pipeline-utilities/select-text.md) - First non-empty selection
- [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output
- [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing
- [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction

## See Also

- [OCR Extractors Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [RapidOCR GitHub](https://github.com/RapidAI/RapidOCR)