# PDF Text Extractor **Extractor ID:** `pdf-text` **Category:** [Text/Document Extractors](index.md) ## Overview The PDF text extractor uses PyPDF to extract embedded text from PDF documents. It works best with digital PDFs that contain selectable text layers, providing fast extraction without OCR overhead. This extractor is ideal for text-based PDFs like reports, papers, and ebooks. For scanned PDFs or images within PDFs, consider using an OCR extractor or VLM-based approach instead. ## Installation The PyPDF library is included as a core dependency: ```bash pip install biblicus ``` No additional dependencies required. ## Supported Media Types - `application/pdf` - PDF documents Only PDF items are processed. Other media types are automatically skipped. ## Configuration ### Config Schema ```python class PortableDocumentFormatTextExtractorConfig(BaseModel): max_pages: Optional[int] = None # Maximum pages to extract ``` ### Configuration Options | Option | Type | Default | Description | |--------|------|---------|-------------| | `max_pages` | int or null | `null` | Maximum number of pages to process (unlimited if null) | ## Usage ### Command Line #### Basic Usage ```bash # Extract text from PDF documents biblicus extract my-corpus --extractor pdf-text ``` #### Custom Configuration ```bash # Extract only first 10 pages of each PDF biblicus extract my-corpus --extractor pdf-text \ --config max_pages=10 ``` #### Configuration File ```yaml extractor_id: pdf-text config: max_pages: 50 ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract with defaults results = corpus.extract_text(extractor_id="pdf-text") # Extract with page limit results = corpus.extract_text( extractor_id="pdf-text", config={"max_pages": 20} ) ``` ### In Pipeline #### PDF-First Fallback Chain ```yaml extractor_id: pipeline config: stages: - extractor_id: pdf-text # Try fast text extraction - extractor_id: docling-smol # Fallback to VLM for scanned PDFs - extractor_id: select-text ``` #### Media Type Routing ```yaml extractor_id: select-smart-override config: default_extractor: pass-through-text overrides: - media_type_pattern: "application/pdf" extractor: pdf-text ``` ## Examples ### Extract Academic Papers Process a collection of research papers: ```bash biblicus extract papers-corpus --extractor pdf-text ``` ### Extract First Pages Only Useful for abstracts or summaries: ```bash biblicus extract papers-corpus --extractor pdf-text \ --config max_pages=2 ``` ### Large Document Corpus Limit pages for performance on large documents: ```python from biblicus import Corpus corpus = Corpus.from_directory("ebooks") # Extract first 100 pages of each book results = corpus.extract_text( extractor_id="pdf-text", config={"max_pages": 100} ) ``` ### Hybrid PDF Pipeline Combine fast text extraction with OCR fallback: ```yaml extractor_id: pipeline config: stages: - extractor_id: pdf-text config: max_pages: null - extractor_id: ocr-rapidocr - extractor_id: select-longest-text # Choose best result ``` ## Behavior Details ### Text Extraction Method PyPDF extracts embedded text from PDF structure. It does not: - Perform OCR on images - Preserve complex formatting - Extract text from images within PDFs - Maintain table structures ### Page Processing Pages are processed sequentially. Text from each page is joined with newlines. ### Empty Pages Pages without extractable text produce empty strings. If all pages are empty, the entire document produces empty extracted text. ### Encoding PyPDF handles PDF text encoding internally. Output is always UTF-8. ## Performance - **Speed**: Fast (5-50 pages/second depending on PDF complexity) - **Memory**: Moderate (entire PDF loaded into memory) - **Accuracy**: 100% for digital PDFs, 0% for scanned PDFs This extractor is significantly faster than OCR or VLM approaches but only works with text-based PDFs. ## Error Handling ### Non-PDF Items Non-PDF items are silently skipped (returns `None`). ### Corrupt PDFs Corrupt or malformed PDFs cause per-item errors recorded in `errored_items` but don't halt extraction. ### Password-Protected PDFs Encrypted PDFs without password support cause per-item failures. ### Scanned PDFs Scanned PDFs (images without text layer) produce empty extracted text and are counted in `extracted_empty_items`. ## Use Cases ### Digital Documents Ideal for born-digital PDFs: ```bash biblicus extract reports-corpus --extractor pdf-text ``` ### Research Papers Extract academic publications: ```bash biblicus extract arxiv-corpus --extractor pdf-text ``` ### Ebooks Process digital book collections: ```bash biblicus extract ebooks --extractor pdf-text ``` ### Mixed PDF Corpus Handle both digital and scanned PDFs with fallback: ```yaml extractor_id: pipeline config: stages: - extractor_id: pdf-text - extractor_id: docling-smol # Handles scanned PDFs - extractor_id: select-longest-text ``` ## When to Use PDF Text vs Alternatives ### Use pdf-text when: - PDFs contain embedded text (digital/born-digital) - Speed is important - You need simple, reliable extraction - PDFs are not scanned/image-based ### Use VLM extractors when: - PDFs are scanned or image-based - You need layout understanding - PDFs contain complex tables or equations - Documents have multi-column layouts ### Use OCR extractors when: - PDFs are scanned but layout is simple - You need faster processing than VLM - Documents are primarily text without complex structure ## Best Practices ### Test with Sample PDFs Always test on representative samples: ```bash # Extract a few PDFs to check quality biblicus extract test-corpus --extractor pdf-text ``` ### Check for Empty Results Monitor `extracted_empty_items` in statistics to identify scanned PDFs. ### Use Fallback for Mixed Corpora Combine with OCR/VLM for heterogeneous PDF collections: ```yaml extractor_id: pipeline config: stages: - extractor_id: pdf-text - extractor_id: ocr-rapidocr - extractor_id: select-text ``` ### Consider Page Limits For large documents, use `max_pages` to control processing time and memory: ```yaml extractor_id: pdf-text config: max_pages: 500 # Reasonable limit for most use cases ``` ## Related Extractors ### Same Category - [pass-through-text](pass-through.md) - Direct text file reading - [metadata-text](metadata.md) - Metadata-based text - [markitdown](markitdown.md) - Office document conversion - [unstructured](unstructured.md) - Universal document parser ### Alternatives for Scanned PDFs - [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR for scanned PDFs - [ocr-paddleocr-vl](../ocr/paddleocr-vl.md) - VL-enhanced OCR - [docling-smol](../vlm-document/docling-smol.md) - VLM for complex PDFs - [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - First non-empty selection - [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output - [select-smart-override](../pipeline-utilities/select-smart-override.md) - Intelligent routing - [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction ## See Also - [Text/Document Extractors Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts - [PyPDF Documentation](https://pypdf.readthedocs.io/)