PDF Text Extractor

Extractor ID: pdf-text

Category: Text/Document Extractors

Overview

The PDF text extractor uses PyPDF to extract embedded text from PDF documents. It works best with digital PDFs that contain selectable text layers, providing fast extraction without OCR overhead.

This extractor is ideal for text-based PDFs like reports, papers, and ebooks. For scanned PDFs or images within PDFs, consider using an OCR extractor or VLM-based approach instead.

Installation

The PyPDF library is included as a core dependency:

pip install biblicus

No additional dependencies required.

Supported Media Types

  • application/pdf - PDF documents

Only PDF items are processed. Other media types are automatically skipped.

Configuration

Config Schema

class PortableDocumentFormatTextExtractorConfig(BaseModel):
    max_pages: Optional[int] = None  # Maximum pages to extract

Configuration Options

Option

Type

Default

Description

max_pages

int or null

null

Maximum number of pages to process (unlimited if null)

Usage

Command Line

Basic Usage

# Extract text from PDF documents
biblicus extract my-corpus --extractor pdf-text

Custom Configuration

# Extract only first 10 pages of each PDF
biblicus extract my-corpus --extractor pdf-text \
  --config max_pages=10

Configuration File

extractor_id: pdf-text
config:
  max_pages: 50
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="pdf-text")

# Extract with page limit
results = corpus.extract_text(
    extractor_id="pdf-text",
    config={"max_pages": 20}
)

In Pipeline

PDF-First Fallback Chain

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text       # Try fast text extraction
    - extractor_id: docling-smol   # Fallback to VLM for scanned PDFs
    - extractor_id: select-text

Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: pdf-text

Examples

Extract Academic Papers

Process a collection of research papers:

biblicus extract papers-corpus --extractor pdf-text

Extract First Pages Only

Useful for abstracts or summaries:

biblicus extract papers-corpus --extractor pdf-text \
  --config max_pages=2

Large Document Corpus

Limit pages for performance on large documents:

from biblicus import Corpus

corpus = Corpus.from_directory("ebooks")

# Extract first 100 pages of each book
results = corpus.extract_text(
    extractor_id="pdf-text",
    config={"max_pages": 100}
)

Hybrid PDF Pipeline

Combine fast text extraction with OCR fallback:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
      config:
        max_pages: null
    - extractor_id: ocr-rapidocr
    - extractor_id: select-longest-text  # Choose best result

Behavior Details

Text Extraction Method

PyPDF extracts embedded text from PDF structure. It does not:

  • Perform OCR on images

  • Preserve complex formatting

  • Extract text from images within PDFs

  • Maintain table structures

Page Processing

Pages are processed sequentially. Text from each page is joined with newlines.

Empty Pages

Pages without extractable text produce empty strings. If all pages are empty, the entire document produces empty extracted text.

Encoding

PyPDF handles PDF text encoding internally. Output is always UTF-8.

Performance

  • Speed: Fast (5-50 pages/second depending on PDF complexity)

  • Memory: Moderate (entire PDF loaded into memory)

  • Accuracy: 100% for digital PDFs, 0% for scanned PDFs

This extractor is significantly faster than OCR or VLM approaches but only works with text-based PDFs.

Error Handling

Non-PDF Items

Non-PDF items are silently skipped (returns None).

Corrupt PDFs

Corrupt or malformed PDFs cause per-item errors recorded in errored_items but don’t halt extraction.

Password-Protected PDFs

Encrypted PDFs without password support cause per-item failures.

Scanned PDFs

Scanned PDFs (images without text layer) produce empty extracted text and are counted in extracted_empty_items.

Use Cases

Digital Documents

Ideal for born-digital PDFs:

biblicus extract reports-corpus --extractor pdf-text

Research Papers

Extract academic publications:

biblicus extract arxiv-corpus --extractor pdf-text

Ebooks

Process digital book collections:

biblicus extract ebooks --extractor pdf-text

Mixed PDF Corpus

Handle both digital and scanned PDFs with fallback:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: docling-smol   # Handles scanned PDFs
    - extractor_id: select-longest-text

When to Use PDF Text vs Alternatives

Use pdf-text when:

  • PDFs contain embedded text (digital/born-digital)

  • Speed is important

  • You need simple, reliable extraction

  • PDFs are not scanned/image-based

Use VLM extractors when:

  • PDFs are scanned or image-based

  • You need layout understanding

  • PDFs contain complex tables or equations

  • Documents have multi-column layouts

Use OCR extractors when:

  • PDFs are scanned but layout is simple

  • You need faster processing than VLM

  • Documents are primarily text without complex structure

Best Practices

Test with Sample PDFs

Always test on representative samples:

# Extract a few PDFs to check quality
biblicus extract test-corpus --extractor pdf-text

Check for Empty Results

Monitor extracted_empty_items in statistics to identify scanned PDFs.

Use Fallback for Mixed Corpora

Combine with OCR/VLM for heterogeneous PDF collections:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pdf-text
    - extractor_id: ocr-rapidocr
    - extractor_id: select-text

Consider Page Limits

For large documents, use max_pages to control processing time and memory:

extractor_id: pdf-text
config:
  max_pages: 500  # Reasonable limit for most use cases

See Also