PaddleOCR-VL Extractor

Extractor ID: ocr-paddleocr-vl

Overview

The PaddleOCR-VL extractor uses PaddleOCR’s vision-language model for optical character recognition. It provides enhanced accuracy for complex layouts and multilingual text, especially Chinese, Japanese, and Korean (CJK) languages.

PaddleOCR-VL combines traditional OCR with vision-language understanding to achieve better results on challenging images. It supports both local inference and API-based processing via HuggingFace Inference API.

Installation

Local Inference

For local processing, install the PaddleOCR library:

pip install "biblicus[paddleocr]"

API-Based Inference

For API-based processing via HuggingFace:

pip install biblicus

No additional dependencies required, but you’ll need a HuggingFace API key.

Supported Media Types

image/png - PNG images
image/jpeg - JPEG/JPG images
image/gif - GIF images
image/bmp - BMP images
image/tiff - TIFF images
image/webp - WebP images

Only image media types are processed. Other media types are automatically skipped.

Configuration

Config Schema

class PaddleOcrVlExtractorConfig(BaseModel):
    backend: InferenceBackendConfig = InferenceBackendConfig()
    min_confidence: float = 0.5
    joiner: str = "\n"
    use_angle_cls: bool = True
    lang: str = "en"

Configuration Options

Option	Type	Default	Description
`backend.mode`	str	`local`	Inference mode: `local` or `api`
`backend.api_provider`	str	`huggingface`	API provider (when mode is `api`)
`backend.api_key`	str or null	`null`	API key (or use env var)
`backend.model_id`	str or null	`null`	Model ID for API inference
`min_confidence`	float	`0.5`	Minimum confidence threshold (0.0-1.0)
`joiner`	str	`"\n"`	String to join recognized lines
`use_angle_cls`	bool	`true`	Use angle classification for rotated text
`lang`	str	`"en"`	Language code (`en`, `ch`, `japan`, `korean`, etc.)

Usage

Command Line

Basic Usage (Local)

# Extract with local inference
biblicus extract my-corpus --extractor ocr-paddleocr-vl

API-Based Inference

# Extract using HuggingFace API
export HUGGINGFACE_API_KEY="your-key-here"

biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config 'backend={"mode":"api","api_provider":"huggingface"}'

Language Configuration

# Process Chinese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=ch

# Process Japanese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=japan

Configuration File

extractor_id: ocr-paddleocr-vl
config:
  backend:
    mode: local
  min_confidence: 0.6
  use_angle_cls: true
  lang: en

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults (local inference)
results = corpus.extract_text(extractor_id="ocr-paddleocr-vl")

# Extract with API backend
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        },
        "min_confidence": 0.7
    }
)

# Extract Chinese text
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}
)

In Pipeline

OCR with Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: select-text

Multi-Language Processing

extractor_id: select-smart-override
config:
  default_extractor: ocr-rapidocr
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-paddleocr-vl
      config:
        lang: ch

Examples

Chinese Document Processing

Extract Chinese text with high accuracy:

biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch

Rotated Text Handling

Use angle classification for rotated images:

biblicus extract rotated-images --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true

API-Based Processing

Use HuggingFace API for serverless OCR:

from biblicus import Corpus
import os

os.environ["HUGGINGFACE_API_KEY"] = "your-key"

corpus = Corpus.from_directory("images")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        }
    }
)

High-Confidence Extraction

Only include very confident results:

biblicus extract images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.8

Inference Backends

Local Inference

Pros:

Full control over processing
No API costs
Works offline
Supports all configuration options

Cons:

Requires installing PaddleOCR
Uses local compute resources
Slower initial model loading

API Inference (HuggingFace)

Pros:

No local dependencies
Serverless/scalable
No model download required

Cons:

Requires API key
API rate limits apply
Network dependency
Limited configuration options

Language Support

PaddleOCR-VL supports many languages:

en - English
ch - Chinese (Simplified)
chinese_cht - Chinese (Traditional)
japan - Japanese
korean - Korean
latin - Latin script languages
arabic - Arabic
cyrillic - Cyrillic script languages
devanagari - Devanagari script languages

And many more. See PaddleOCR documentation for the full list.

Performance

Local Inference

Speed: Moderate (2-5 seconds per image)
Memory: High (model loaded in memory)
Accuracy: Excellent for CJK, good for Latin

API Inference

Speed: Variable (depends on API latency)
Memory: Minimal (no local model)
Accuracy: Good (model-dependent)

Error Handling

Missing Dependency (Local Mode)

If PaddleOCR is not installed:

ExtractionRunFatalError: PaddleOCR-VL extractor (local mode) requires paddleocr.
Install it with pip install "biblicus[paddleocr]".

Missing API Key (API Mode)

If API key is not configured:

ExtractionRunFatalError: PaddleOCR-VL extractor (API mode) requires an API key for HUGGINGFACE.
Set HUGGINGFACE_API_KEY environment variable or configure huggingface in user config.

Non-Image Items

Non-image items are silently skipped (returns None).

No Text Recognized

Images without recognizable text produce empty extracted text and are counted in extracted_empty_items.

Use Cases

Chinese Document Processing

Ideal for Chinese text extraction:

biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch

Japanese Manga/Comics

Extract Japanese text from comics:

biblicus extract manga --extractor ocr-paddleocr-vl \
  --config lang=japan

Multi-Language Corpora

Process documents in multiple languages:

from biblicus import Corpus

corpus = Corpus.from_directory("multilingual")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}  # or detect automatically
)

Rotated Document Photos

Handle photos taken at angles:

biblicus extract photos --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true

When to Use PaddleOCR-VL vs Alternatives

Use PaddleOCR-VL when:

Processing CJK languages (Chinese, Japanese, Korean)
Text may be rotated or skewed
You need better accuracy than basic OCR
Complex layouts require VL understanding

Use RapidOCR when:

Processing primarily English text
Speed is more important than accuracy
Simple layouts with clear text
Local inference only

Use VLM extractors when:

Documents have complex visual layouts
You need table/equation understanding
Highest accuracy is critical
Multi-modal understanding is needed

Configuration via User Config

Configure API keys in ~/.biblicus/config.yml:

huggingface:
  api_key: YOUR_KEY_HERE

Or use environment variables:

export HUGGINGFACE_API_KEY="your-key"

Best Practices

Choose Appropriate Language

Set the lang parameter to match your content:

config:
  lang: ch  # For Chinese content

Tune Confidence Threshold

Test different thresholds on samples:

biblicus extract test-images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.7

Use Local for Batch Processing

For large corpora, local inference is more cost-effective:

config:
  backend:
    mode: local

Use API for Quick Tests

For small jobs or testing, API mode is convenient:

config:
  backend:
    mode: api
    api_provider: huggingface

PaddleOCR-VL Extractor

Overview

Installation

Local Inference

API-Based Inference

Supported Media Types

Configuration

Config Schema

Configuration Options

Usage

Command Line

Basic Usage (Local)

API-Based Inference

Language Configuration

Configuration File

Python API

In Pipeline

OCR with Fallback

Multi-Language Processing

Examples

Chinese Document Processing

Rotated Text Handling

API-Based Processing

High-Confidence Extraction

Inference Backends

Local Inference

API Inference (HuggingFace)

Language Support

Performance

Local Inference

API Inference

Error Handling

Missing Dependency (Local Mode)

Missing API Key (API Mode)

Non-Image Items

No Text Recognized

Use Cases

Chinese Document Processing

Japanese Manga/Comics

Multi-Language Corpora

Rotated Document Photos

When to Use PaddleOCR-VL vs Alternatives

Use PaddleOCR-VL when:

Use RapidOCR when:

Use VLM extractors when:

Configuration via User Config

Best Practices

Choose Appropriate Language

Tune Confidence Threshold

Use Local for Batch Processing

Use API for Quick Tests

Related Extractors

Same Category

Alternatives

Pipeline Utilities

See Also