PDF Text Extractor

Extractor ID: pdf-text

Overview

The PDF text extractor uses PyPDF to extract embedded text from PDF documents. It works best with digital PDFs that contain selectable text layers, providing fast extraction without OCR overhead.

This extractor is ideal for text-based PDFs like reports, papers, and ebooks. For scanned PDFs or images within PDFs, consider using an OCR extractor or VLM-based approach instead.

Installation

The PyPDF library is included as a core dependency:

pip install biblicus

No additional dependencies required.

Supported Media Types

application/pdf - PDF documents

Only PDF items are processed. Other media types are automatically skipped.

Configuration

Config Schema

class PortableDocumentFormatTextExtractorConfig(BaseModel):
    max_pages: Optional[int] = None  # Maximum pages to extract

Configuration Options

Option	Type	Default	Description
`max_pages`	int or null	`null`	Maximum number of pages to process (unlimited if null)

Usage

Command Line

Basic Usage

# Extract text from PDF documents
biblicus extract my-corpus --extractor pdf-text

Custom Configuration

# Extract only first 10 pages of each PDF
biblicus extract my-corpus --extractor pdf-text \
  --config max_pages=10

Configuration File

extractor_id: pdf-text
config:
  max_pages: 50

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="pdf-text")

# Extract with page limit
results = corpus.extract_text(
    extractor_id="pdf-text",
    config={"max_pages": 20}
)

In Pipeline

PDF-First Fallback Chain

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text       # Try fast text extraction
    - extractor_id: docling-smol   # Fallback to VLM for scanned PDFs
    - extractor_id: select-text

Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text

Examples

Extract Academic Papers

Process a collection of research papers:

biblicus extract papers-corpus --extractor pdf-text

Extract First Pages Only

Useful for abstracts or summaries:

biblicus extract papers-corpus --extractor pdf-text \
  --config max_pages=2

Large Document Corpus

Limit pages for performance on large documents:

from biblicus import Corpus

corpus = Corpus.from_directory("ebooks")

# Extract first 100 pages of each book
results = corpus.extract_text(
    extractor_id="pdf-text",
    config={"max_pages": 100}
)

Hybrid PDF Pipeline

Combine fast text extraction with OCR fallback:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
      config:
        max_pages: null
    - extractor_id: ocr-rapidocr
    - extractor_id: select-longest-text  # Choose best result

Behavior Details

Text Extraction Method

PyPDF extracts embedded text from PDF structure. It does not:

Perform OCR on images
Preserve complex formatting
Extract text from images within PDFs
Maintain table structures

Page Processing

Pages are processed sequentially. Text from each page is joined with newlines.

Empty Pages

Pages without extractable text produce empty strings. If all pages are empty, the entire document produces empty extracted text.

Encoding

PyPDF handles PDF text encoding internally. Output is always UTF-8.

Performance

Speed: Fast (5-50 pages/second depending on PDF complexity)
Memory: Moderate (entire PDF loaded into memory)
Accuracy: 100% for digital PDFs, 0% for scanned PDFs

This extractor is significantly faster than OCR or VLM approaches but only works with text-based PDFs.

Error Handling

Non-PDF Items

Non-PDF items are silently skipped (returns None).

Corrupt PDFs

Corrupt or malformed PDFs cause per-item errors recorded in errored_items but don’t halt extraction.

Password-Protected PDFs

Encrypted PDFs without password support cause per-item failures.

Scanned PDFs

Scanned PDFs (images without text layer) produce empty extracted text and are counted in extracted_empty_items.

Use Cases

Digital Documents

Ideal for born-digital PDFs:

biblicus extract reports-corpus --extractor pdf-text

Research Papers

Extract academic publications:

biblicus extract arxiv-corpus --extractor pdf-text

Ebooks

Process digital book collections:

biblicus extract ebooks --extractor pdf-text

Mixed PDF Corpus

Handle both digital and scanned PDFs with fallback:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: docling-smol   # Handles scanned PDFs
    - extractor_id: select-longest-text

When to Use PDF Text vs Alternatives

Use pdf-text when:

PDFs contain embedded text (digital/born-digital)
Speed is important
You need simple, reliable extraction
PDFs are not scanned/image-based

Use VLM extractors when:

PDFs are scanned or image-based
You need layout understanding
PDFs contain complex tables or equations
Documents have multi-column layouts

Use OCR extractors when:

PDFs are scanned but layout is simple
You need faster processing than VLM
Documents are primarily text without complex structure

Best Practices

Test with Sample PDFs

Always test on representative samples:

# Extract a few PDFs to check quality
biblicus extract test-corpus --extractor pdf-text

Check for Empty Results

Monitor extracted_empty_items in statistics to identify scanned PDFs.

Use Fallback for Mixed Corpora

Combine with OCR/VLM for heterogeneous PDF collections:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Consider Page Limits

For large documents, use max_pages to control processing time and memory:

extractor_id: pdf-text
config:
  max_pages: 500  # Reasonable limit for most use cases