PaddleOCR-VL Extractor
Extractor ID: ocr-paddleocr-vl
Category: OCR Extractors
Overview
The PaddleOCR-VL extractor uses PaddleOCR’s vision-language model for optical character recognition. It provides enhanced accuracy for complex layouts and multilingual text, especially Chinese, Japanese, and Korean (CJK) languages.
PaddleOCR-VL combines traditional OCR with vision-language understanding to achieve better results on challenging images. It supports both local inference and API-based processing via HuggingFace Inference API.
Installation
Local Inference
For local processing, install the PaddleOCR library:
pip install "biblicus[paddleocr]"
API-Based Inference
For API-based processing via HuggingFace:
pip install biblicus
No additional dependencies required, but you’ll need a HuggingFace API key.
Supported Media Types
image/png- PNG imagesimage/jpeg- JPEG/JPG imagesimage/gif- GIF imagesimage/bmp- BMP imagesimage/tiff- TIFF imagesimage/webp- WebP images
Only image media types are processed. Other media types are automatically skipped.
Configuration
Config Schema
class PaddleOcrVlExtractorConfig(BaseModel):
backend: InferenceBackendConfig = InferenceBackendConfig()
min_confidence: float = 0.5
joiner: str = "\n"
use_angle_cls: bool = True
lang: str = "en"
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Inference mode: |
|
str |
|
API provider (when mode is |
|
str or null |
|
API key (or use env var) |
|
str or null |
|
Model ID for API inference |
|
float |
|
Minimum confidence threshold (0.0-1.0) |
|
str |
|
String to join recognized lines |
|
bool |
|
Use angle classification for rotated text |
|
str |
|
Language code ( |
Usage
Command Line
Basic Usage (Local)
# Extract with local inference
biblicus extract my-corpus --extractor ocr-paddleocr-vl
API-Based Inference
# Extract using HuggingFace API
export HUGGINGFACE_API_KEY="your-key-here"
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
--config 'backend={"mode":"api","api_provider":"huggingface"}'
Language Configuration
# Process Chinese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
--config lang=ch
# Process Japanese text
biblicus extract my-corpus --extractor ocr-paddleocr-vl \
--config lang=japan
Configuration File
extractor_id: ocr-paddleocr-vl
config:
backend:
mode: local
min_confidence: 0.6
use_angle_cls: true
lang: en
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults (local inference)
results = corpus.extract_text(extractor_id="ocr-paddleocr-vl")
# Extract with API backend
results = corpus.extract_text(
extractor_id="ocr-paddleocr-vl",
config={
"backend": {
"mode": "api",
"api_provider": "huggingface"
},
"min_confidence": 0.7
}
)
# Extract Chinese text
results = corpus.extract_text(
extractor_id="ocr-paddleocr-vl",
config={"lang": "ch"}
)
In Pipeline
OCR with Fallback
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: ocr-paddleocr-vl
- extractor_id: select-text
Multi-Language Processing
extractor_id: select-smart-override
config:
default_extractor: ocr-rapidocr
overrides:
- media_type_pattern: "image/.*"
extractor: ocr-paddleocr-vl
config:
lang: ch
Examples
Chinese Document Processing
Extract Chinese text with high accuracy:
biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
--config lang=ch
Rotated Text Handling
Use angle classification for rotated images:
biblicus extract rotated-images --extractor ocr-paddleocr-vl \
--config use_angle_cls=true
API-Based Processing
Use HuggingFace API for serverless OCR:
from biblicus import Corpus
import os
os.environ["HUGGINGFACE_API_KEY"] = "your-key"
corpus = Corpus.from_directory("images")
results = corpus.extract_text(
extractor_id="ocr-paddleocr-vl",
config={
"backend": {
"mode": "api",
"api_provider": "huggingface"
}
}
)
High-Confidence Extraction
Only include very confident results:
biblicus extract images --extractor ocr-paddleocr-vl \
--config min_confidence=0.8
Inference Backends
Local Inference
Pros:
Full control over processing
No API costs
Works offline
Supports all configuration options
Cons:
Requires installing PaddleOCR
Uses local compute resources
Slower initial model loading
API Inference (HuggingFace)
Pros:
No local dependencies
Serverless/scalable
No model download required
Cons:
Requires API key
API rate limits apply
Network dependency
Limited configuration options
Language Support
PaddleOCR-VL supports many languages:
en- Englishch- Chinese (Simplified)chinese_cht- Chinese (Traditional)japan- Japanesekorean- Koreanlatin- Latin script languagesarabic- Arabiccyrillic- Cyrillic script languagesdevanagari- Devanagari script languages
And many more. See PaddleOCR documentation for the full list.
Performance
Local Inference
Speed: Moderate (2-5 seconds per image)
Memory: High (model loaded in memory)
Accuracy: Excellent for CJK, good for Latin
API Inference
Speed: Variable (depends on API latency)
Memory: Minimal (no local model)
Accuracy: Good (model-dependent)
Error Handling
Missing Dependency (Local Mode)
If PaddleOCR is not installed:
ExtractionRunFatalError: PaddleOCR-VL extractor (local mode) requires paddleocr.
Install it with pip install "biblicus[paddleocr]".
Missing API Key (API Mode)
If API key is not configured:
ExtractionRunFatalError: PaddleOCR-VL extractor (API mode) requires an API key for HUGGINGFACE.
Set HUGGINGFACE_API_KEY environment variable or configure huggingface in user config.
Non-Image Items
Non-image items are silently skipped (returns None).
No Text Recognized
Images without recognizable text produce empty extracted text and are counted in extracted_empty_items.
Use Cases
Chinese Document Processing
Ideal for Chinese text extraction:
biblicus extract chinese-docs --extractor ocr-paddleocr-vl \
--config lang=ch
Japanese Manga/Comics
Extract Japanese text from comics:
biblicus extract manga --extractor ocr-paddleocr-vl \
--config lang=japan
Multi-Language Corpora
Process documents in multiple languages:
from biblicus import Corpus
corpus = Corpus.from_directory("multilingual")
results = corpus.extract_text(
extractor_id="ocr-paddleocr-vl",
config={"lang": "ch"} # or detect automatically
)
Rotated Document Photos
Handle photos taken at angles:
biblicus extract photos --extractor ocr-paddleocr-vl \
--config use_angle_cls=true
When to Use PaddleOCR-VL vs Alternatives
Use PaddleOCR-VL when:
Processing CJK languages (Chinese, Japanese, Korean)
Text may be rotated or skewed
You need better accuracy than basic OCR
Complex layouts require VL understanding
Use RapidOCR when:
Processing primarily English text
Speed is more important than accuracy
Simple layouts with clear text
Local inference only
Use VLM extractors when:
Documents have complex visual layouts
You need table/equation understanding
Highest accuracy is critical
Multi-modal understanding is needed
Configuration via User Config
Configure API keys in ~/.biblicus/config.yml:
huggingface:
api_key: YOUR_KEY_HERE
Or use environment variables:
export HUGGINGFACE_API_KEY="your-key"
Best Practices
Choose Appropriate Language
Set the lang parameter to match your content:
config:
lang: ch # For Chinese content
Tune Confidence Threshold
Test different thresholds on samples:
biblicus extract test-images --extractor ocr-paddleocr-vl \
--config min_confidence=0.7
Use Local for Batch Processing
For large corpora, local inference is more cost-effective:
config:
backend:
mode: local
Use API for Quick Tests
For small jobs or testing, API mode is convenient:
config:
backend:
mode: api
api_provider: huggingface
See Also
extraction.md - Extraction pipeline concepts