# PaddleOCR-VL Extractor

**Extractor ID:** `ocr-paddleocr-vl`

**Category:** [OCR Extractors](index.md)

## Overview

The PaddleOCR-VL extractor uses PaddleOCR's vision-language model for optical character recognition. It provides enhanced accuracy for complex layouts and multilingual text, especially Chinese, Japanese, and Korean (CJK) languages.

PaddleOCR-VL combines traditional OCR with vision-language understanding to achieve better results on challenging images. It supports both local inference and API-based processing via HuggingFace Inference API.

## Installation

### Local Inference

For local processing, install the PaddleOCR library:

```bash
pip install "biblicus[paddleocr]"
```

### API-Based Inference

For API-based processing via HuggingFace:

```bash
pip install biblicus
```

No additional dependencies required, but you'll need a HuggingFace API key.

## Supported Media Types

- `image/png` - PNG images
- `image/jpeg` - JPEG/JPG images
- `image/gif` - GIF images
- `image/bmp` - BMP images
- `image/tiff` - TIFF images
- `image/webp` - WebP images

Only image media types are processed. Other media types are automatically skipped.

## Configuration

### Config Schema

```python
class PaddleOcrVlExtractorConfig(BaseModel):
    backend: InferenceBackendConfig = InferenceBackendConfig()
    min_confidence: float = 0.5
    joiner: str = "\n"
    use_angle_cls: bool = True
    lang: str = "en"
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `backend.mode` | str | `local` | Inference mode: `local` or `api` |
| `backend.api_provider` | str | `huggingface` | API provider (when mode is `api`) |
| `backend.api_key` | str or null | `null` | API key (or use env var) |
| `backend.model_id` | str or null | `null` | Model ID for API inference |
| `min_confidence` | float | `0.5` | Minimum confidence threshold (0.0-1.0) |
| `joiner` | str | `"\n"` | String to join recognized lines |
| `use_angle_cls` | bool | `true` | Use angle classification for rotated text |
| `lang` | str | `"en"` | Language code (`en`, `ch`, `japan`, `korean`, etc.) |

## Usage

### Command Line

#### Basic Usage (Local)

```bash
# Extract with local inference
biblicus extract my-corpus --extractor ocr-paddleocr-vl
```

#### API-Based Inference

```bash
# Extract using HuggingFace API
export HUGGINGFACE_API_KEY="your-key-here"

biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config 'backend={"mode":"api","api_provider":"huggingface"}'
```

#### Language Configuration

```bash
# Process Chinese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=ch

# Process Japanese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=japan
```

#### Configuration File

```yaml
extractor_id: ocr-paddleocr-vl
config:
  backend:
    mode: local
  min_confidence: 0.6
  use_angle_cls: true
  lang: en
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults (local inference)
results = corpus.extract_text(extractor_id="ocr-paddleocr-vl")

# Extract with API backend
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        },
        "min_confidence": 0.7
    }
)

# Extract Chinese text
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}
)
```

### In Pipeline

#### OCR with Fallback

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: select-text
```

#### Multi-Language Processing

```yaml
extractor_id: select-smart-override
config:
  default_extractor: ocr-rapidocr
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-paddleocr-vl
      config:
        lang: ch
```

## Examples

### Chinese Document Processing

Extract Chinese text with high accuracy:

```bash
biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch
```

### Rotated Text Handling

Use angle classification for rotated images:

```bash
biblicus extract rotated-images --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true
```

### API-Based Processing

Use HuggingFace API for serverless OCR:

```python
from biblicus import Corpus
import os

os.environ["HUGGINGFACE_API_KEY"] = "your-key"

corpus = Corpus.from_directory("images")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        }
    }
)
```

### High-Confidence Extraction

Only include very confident results:

```bash
biblicus extract images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.8
```

## Inference Backends

### Local Inference

**Pros:**
- Full control over processing
- No API costs
- Works offline
- Supports all configuration options

**Cons:**
- Requires installing PaddleOCR
- Uses local compute resources
- Slower initial model loading

### API Inference (HuggingFace)

**Pros:**
- No local dependencies
- Serverless/scalable
- No model download required

**Cons:**
- Requires API key
- API rate limits apply
- Network dependency
- Limited configuration options

## Language Support

PaddleOCR-VL supports many languages:

- `en` - English
- `ch` - Chinese (Simplified)
- `chinese_cht` - Chinese (Traditional)
- `japan` - Japanese
- `korean` - Korean
- `latin` - Latin script languages
- `arabic` - Arabic
- `cyrillic` - Cyrillic script languages
- `devanagari` - Devanagari script languages

And many more. See PaddleOCR documentation for the full list.

## Performance

### Local Inference
- **Speed**: Moderate (2-5 seconds per image)
- **Memory**: High (model loaded in memory)
- **Accuracy**: Excellent for CJK, good for Latin

### API Inference
- **Speed**: Variable (depends on API latency)
- **Memory**: Minimal (no local model)
- **Accuracy**: Good (model-dependent)

## Error Handling

### Missing Dependency (Local Mode)

If PaddleOCR is not installed:

```
ExtractionRunFatalError: PaddleOCR-VL extractor (local mode) requires paddleocr.
Install it with pip install "biblicus[paddleocr]".
```

### Missing API Key (API Mode)

If API key is not configured:

```
ExtractionRunFatalError: PaddleOCR-VL extractor (API mode) requires an API key for HUGGINGFACE.
Set HUGGINGFACE_API_KEY environment variable or configure huggingface in user config.
```

### Non-Image Items

Non-image items are silently skipped (returns `None`).

### No Text Recognized

Images without recognizable text produce empty extracted text and are counted in `extracted_empty_items`.

## Use Cases

### Chinese Document Processing

Ideal for Chinese text extraction:

```bash
biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch
```

### Japanese Manga/Comics

Extract Japanese text from comics:

```bash
biblicus extract manga --extractor ocr-paddleocr-vl \
  --config lang=japan
```

### Multi-Language Corpora

Process documents in multiple languages:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("multilingual")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}  # or detect automatically
)
```

### Rotated Document Photos

Handle photos taken at angles:

```bash
biblicus extract photos --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true
```

## When to Use PaddleOCR-VL vs Alternatives

### Use PaddleOCR-VL when:
- Processing CJK languages (Chinese, Japanese, Korean)
- Text may be rotated or skewed
- You need better accuracy than basic OCR
- Complex layouts require VL understanding

### Use RapidOCR when:
- Processing primarily English text
- Speed is more important than accuracy
- Simple layouts with clear text
- Local inference only

### Use VLM extractors when:
- Documents have complex visual layouts
- You need table/equation understanding
- Highest accuracy is critical
- Multi-modal understanding is needed

## Configuration via User Config

Configure API keys in `~/.biblicus/config.yml`:

```yaml
huggingface:
  api_key: YOUR_KEY_HERE
```

Or use environment variables:

```bash
export HUGGINGFACE_API_KEY="your-key"
```

## Best Practices

### Choose Appropriate Language

Set the `lang` parameter to match your content:

```yaml
config:
  lang: ch  # For Chinese content
```

### Tune Confidence Threshold

Test different thresholds on samples:

```bash
biblicus extract test-images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.7
```

### Use Local for Batch Processing

For large corpora, local inference is more cost-effective:

```yaml
config:
  backend:
    mode: local
```

### Use API for Quick Tests

For small jobs or testing, API mode is convenient:

```yaml
config:
  backend:
    mode: api
    api_provider: huggingface
```

## Related Extractors

### Same Category

- [ocr-rapidocr](rapidocr.md) - Fast traditional OCR

### Alternatives

- [docling-smol](../vlm-document/docling-smol.md) - Fast VLM for documents
- [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM
- [markitdown](../text-document/markitdown.md) - Office document conversion

### Pipeline Utilities

- [select-text](../pipeline-utilities/select-text.md) - First non-empty selection
- [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output
- [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing
- [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction

## See Also

- [OCR Extractors Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [Inference Backend Configuration](../../user-configuration.md)
- [PaddleOCR GitHub](https://github.com/PaddlePaddle/PaddleOCR)