PDF Text Extractor
Extractor ID: pdf-text
Category: Text/Document Extractors
Overview
The PDF text extractor uses PyPDF to extract embedded text from PDF documents. It works best with digital PDFs that contain selectable text layers, providing fast extraction without OCR overhead.
This extractor is ideal for text-based PDFs like reports, papers, and ebooks. For scanned PDFs or images within PDFs, consider using an OCR extractor or VLM-based approach instead.
Installation
The PyPDF library is included as a core dependency:
pip install biblicus
No additional dependencies required.
Supported Media Types
application/pdf- PDF documents
Only PDF items are processed. Other media types are automatically skipped.
Configuration
Config Schema
class PortableDocumentFormatTextExtractorConfig(BaseModel):
max_pages: Optional[int] = None # Maximum pages to extract
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
int or null |
|
Maximum number of pages to process (unlimited if null) |
Usage
Command Line
Basic Usage
# Extract text from PDF documents
biblicus extract my-corpus --extractor pdf-text
Custom Configuration
# Extract only first 10 pages of each PDF
biblicus extract my-corpus --extractor pdf-text \
--config max_pages=10
Configuration File
extractor_id: pdf-text
config:
max_pages: 50
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="pdf-text")
# Extract with page limit
results = corpus.extract_text(
extractor_id="pdf-text",
config={"max_pages": 20}
)
In Pipeline
PDF-First Fallback Chain
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text # Try fast text extraction
- extractor_id: docling-smol # Fallback to VLM for scanned PDFs
- extractor_id: select-text
Media Type Routing
extractor_id: select-smart-override
config:
default_extractor: pass-through-text
overrides:
- media_type_pattern: "application/pdf"
extractor: pdf-text
Examples
Extract Academic Papers
Process a collection of research papers:
biblicus extract papers-corpus --extractor pdf-text
Extract First Pages Only
Useful for abstracts or summaries:
biblicus extract papers-corpus --extractor pdf-text \
--config max_pages=2
Large Document Corpus
Limit pages for performance on large documents:
from biblicus import Corpus
corpus = Corpus.from_directory("ebooks")
# Extract first 100 pages of each book
results = corpus.extract_text(
extractor_id="pdf-text",
config={"max_pages": 100}
)
Hybrid PDF Pipeline
Combine fast text extraction with OCR fallback:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
config:
max_pages: null
- extractor_id: ocr-rapidocr
- extractor_id: select-longest-text # Choose best result
Behavior Details
Text Extraction Method
PyPDF extracts embedded text from PDF structure. It does not:
Perform OCR on images
Preserve complex formatting
Extract text from images within PDFs
Maintain table structures
Page Processing
Pages are processed sequentially. Text from each page is joined with newlines.
Empty Pages
Pages without extractable text produce empty strings. If all pages are empty, the entire document produces empty extracted text.
Encoding
PyPDF handles PDF text encoding internally. Output is always UTF-8.
Performance
Speed: Fast (5-50 pages/second depending on PDF complexity)
Memory: Moderate (entire PDF loaded into memory)
Accuracy: 100% for digital PDFs, 0% for scanned PDFs
This extractor is significantly faster than OCR or VLM approaches but only works with text-based PDFs.
Error Handling
Non-PDF Items
Non-PDF items are silently skipped (returns None).
Corrupt PDFs
Corrupt or malformed PDFs cause per-item errors recorded in errored_items but don’t halt extraction.
Password-Protected PDFs
Encrypted PDFs without password support cause per-item failures.
Scanned PDFs
Scanned PDFs (images without text layer) produce empty extracted text and are counted in extracted_empty_items.
Use Cases
Digital Documents
Ideal for born-digital PDFs:
biblicus extract reports-corpus --extractor pdf-text
Research Papers
Extract academic publications:
biblicus extract arxiv-corpus --extractor pdf-text
Ebooks
Process digital book collections:
biblicus extract ebooks --extractor pdf-text
Mixed PDF Corpus
Handle both digital and scanned PDFs with fallback:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: docling-smol # Handles scanned PDFs
- extractor_id: select-longest-text
When to Use PDF Text vs Alternatives
Use pdf-text when:
PDFs contain embedded text (digital/born-digital)
Speed is important
You need simple, reliable extraction
PDFs are not scanned/image-based
Use VLM extractors when:
PDFs are scanned or image-based
You need layout understanding
PDFs contain complex tables or equations
Documents have multi-column layouts
Use OCR extractors when:
PDFs are scanned but layout is simple
You need faster processing than VLM
Documents are primarily text without complex structure
Best Practices
Test with Sample PDFs
Always test on representative samples:
# Extract a few PDFs to check quality
biblicus extract test-corpus --extractor pdf-text
Check for Empty Results
Monitor extracted_empty_items in statistics to identify scanned PDFs.
Use Fallback for Mixed Corpora
Combine with OCR/VLM for heterogeneous PDF collections:
extractor_id: pipeline
config:
stages:
- extractor_id: pdf-text
- extractor_id: ocr-rapidocr
- extractor_id: select-text
Consider Page Limits
For large documents, use max_pages to control processing time and memory:
extractor_id: pdf-text
config:
max_pages: 500 # Reasonable limit for most use cases
See Also
extraction.md - Extraction pipeline concepts