PaddleOCR-VL Extractor

Extractor ID: ocr-paddleocr-vl

Category: OCR Extractors

Overview

The PaddleOCR-VL extractor uses PaddleOCR’s vision-language model for optical character recognition. It provides enhanced accuracy for complex layouts and multilingual text, especially Chinese, Japanese, and Korean (CJK) languages.

PaddleOCR-VL combines traditional OCR with vision-language understanding to achieve better results on challenging images. It supports both local inference and API-based processing via HuggingFace Inference API.

Installation

Local Inference

For local processing, install the PaddleOCR library:

pip install "biblicus[paddleocr]"

API-Based Inference

For API-based processing via HuggingFace:

pip install biblicus

No additional dependencies required, but you’ll need a HuggingFace API key.

Supported Media Types

  • image/png - PNG images

  • image/jpeg - JPEG/JPG images

  • image/gif - GIF images

  • image/bmp - BMP images

  • image/tiff - TIFF images

  • image/webp - WebP images

Only image media types are processed. Other media types are automatically skipped.

Configuration

Config Schema

class PaddleOcrVlExtractorConfig(BaseModel):
    backend: InferenceBackendConfig = InferenceBackendConfig()
    min_confidence: float = 0.5
    joiner: str = "\n"
    use_angle_cls: bool = True
    lang: str = "en"

Configuration Options

Option

Type

Default

Description

backend.mode

str

local

Inference mode: local or api

backend.api_provider

str

huggingface

API provider (when mode is api)

backend.api_key

str or null

null

API key (or use env var)

backend.model_id

str or null

null

Model ID for API inference

min_confidence

float

0.5

Minimum confidence threshold (0.0-1.0)

joiner

str

"\n"

String to join recognized lines

use_angle_cls

bool

true

Use angle classification for rotated text

lang

str

"en"

Language code (en, ch, japan, korean, etc.)

Usage

Command Line

Basic Usage (Local)

# Extract with local inference
biblicus extract my-corpus --extractor ocr-paddleocr-vl

API-Based Inference

# Extract using HuggingFace API
export HUGGINGFACE_API_KEY="your-key-here"

biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config 'backend={"mode":"api","api_provider":"huggingface"}'

Language Configuration

# Process Chinese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=ch

# Process Japanese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
  --config lang=japan

Configuration File

extractor_id: ocr-paddleocr-vl
config:
  backend:
    mode: local
  min_confidence: 0.6
  use_angle_cls: true
  lang: en
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults (local inference)
results = corpus.extract_text(extractor_id="ocr-paddleocr-vl")

# Extract with API backend
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        },
        "min_confidence": 0.7
    }
)

# Extract Chinese text
results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}
)

In Pipeline

OCR with Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: ocr-paddleocr-vl
    - extractor_id: select-text

Multi-Language Processing

extractor_id: select-smart-override
config:
  default_extractor: ocr-rapidocr
  overrides:
    - media_type_pattern: "image/.*"
      extractor: ocr-paddleocr-vl
      config:
        lang: ch

Examples

Chinese Document Processing

Extract Chinese text with high accuracy:

biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch

Rotated Text Handling

Use angle classification for rotated images:

biblicus extract rotated-images --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true

API-Based Processing

Use HuggingFace API for serverless OCR:

from biblicus import Corpus
import os

os.environ["HUGGINGFACE_API_KEY"] = "your-key"

corpus = Corpus.from_directory("images")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={
        "backend": {
            "mode": "api",
            "api_provider": "huggingface"
        }
    }
)

High-Confidence Extraction

Only include very confident results:

biblicus extract images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.8

Inference Backends

Local Inference

Pros:

  • Full control over processing

  • No API costs

  • Works offline

  • Supports all configuration options

Cons:

  • Requires installing PaddleOCR

  • Uses local compute resources

  • Slower initial model loading

API Inference (HuggingFace)

Pros:

  • No local dependencies

  • Serverless/scalable

  • No model download required

Cons:

  • Requires API key

  • API rate limits apply

  • Network dependency

  • Limited configuration options

Language Support

PaddleOCR-VL supports many languages:

  • en - English

  • ch - Chinese (Simplified)

  • chinese_cht - Chinese (Traditional)

  • japan - Japanese

  • korean - Korean

  • latin - Latin script languages

  • arabic - Arabic

  • cyrillic - Cyrillic script languages

  • devanagari - Devanagari script languages

And many more. See PaddleOCR documentation for the full list.

Performance

Local Inference

  • Speed: Moderate (2-5 seconds per image)

  • Memory: High (model loaded in memory)

  • Accuracy: Excellent for CJK, good for Latin

API Inference

  • Speed: Variable (depends on API latency)

  • Memory: Minimal (no local model)

  • Accuracy: Good (model-dependent)

Error Handling

Missing Dependency (Local Mode)

If PaddleOCR is not installed:

ExtractionRunFatalError: PaddleOCR-VL extractor (local mode) requires paddleocr.
Install it with pip install "biblicus[paddleocr]".

Missing API Key (API Mode)

If API key is not configured:

ExtractionRunFatalError: PaddleOCR-VL extractor (API mode) requires an API key for HUGGINGFACE.
Set HUGGINGFACE_API_KEY environment variable or configure huggingface in user config.

Non-Image Items

Non-image items are silently skipped (returns None).

No Text Recognized

Images without recognizable text produce empty extracted text and are counted in extracted_empty_items.

Use Cases

Chinese Document Processing

Ideal for Chinese text extraction:

biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
  --config lang=ch

Japanese Manga/Comics

Extract Japanese text from comics:

biblicus extract manga --extractor ocr-paddleocr-vl \
  --config lang=japan

Multi-Language Corpora

Process documents in multiple languages:

from biblicus import Corpus

corpus = Corpus.from_directory("multilingual")

results = corpus.extract_text(
    extractor_id="ocr-paddleocr-vl",
    config={"lang": "ch"}  # or detect automatically
)

Rotated Document Photos

Handle photos taken at angles:

biblicus extract photos --extractor ocr-paddleocr-vl \
  --config use_angle_cls=true

When to Use PaddleOCR-VL vs Alternatives

Use PaddleOCR-VL when:

  • Processing CJK languages (Chinese, Japanese, Korean)

  • Text may be rotated or skewed

  • You need better accuracy than basic OCR

  • Complex layouts require VL understanding

Use RapidOCR when:

  • Processing primarily English text

  • Speed is more important than accuracy

  • Simple layouts with clear text

  • Local inference only

Use VLM extractors when:

  • Documents have complex visual layouts

  • You need table/equation understanding

  • Highest accuracy is critical

  • Multi-modal understanding is needed

Configuration via User Config

Configure API keys in ~/.biblicus/config.yml:

huggingface:
  api_key: YOUR_KEY_HERE

Or use environment variables:

export HUGGINGFACE_API_KEY="your-key"

Best Practices

Choose Appropriate Language

Set the lang parameter to match your content:

config:
  lang: ch  # For Chinese content

Tune Confidence Threshold

Test different thresholds on samples:

biblicus extract test-images --extractor ocr-paddleocr-vl \
  --config min_confidence=0.7

Use Local for Batch Processing

For large corpora, local inference is more cost-effective:

config:
  backend:
    mode: local

Use API for Quick Tests

For small jobs or testing, API mode is convenient:

config:
  backend:
    mode: api
    api_provider: huggingface

See Also