# Unstructured Extractor **Extractor ID:** `unstructured` **Category:** [Text/Document Extractors](index.md) ## Overview The Unstructured extractor uses the Unstructured.io library to parse a wide variety of document formats. It's designed as a universal, last-resort extractor with broad format coverage when specialized extractors aren't suitable. Unstructured provides robust handling of diverse document types including Office documents, PDFs, HTML, emails, and many others. Like MarkItDown, it automatically skips text items to preserve canonical text handling. ## Installation Unstructured is an optional dependency: ```bash pip install "biblicus[unstructured]" ``` ### System Requirements Unstructured may require additional system libraries depending on document types: ```bash # Ubuntu/Debian sudo apt-get install libmagic-dev poppler-utils tesseract-ocr # macOS brew install libmagic poppler tesseract ``` ## Supported Media Types Unstructured supports an extensive range of formats: ### Office Documents - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` - DOCX - `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` - XLSX - `application/vnd.openxmlformats-officedocument.presentationml.presentation` - PPTX - `application/msword` - DOC - `application/vnd.ms-excel` - XLS - `application/vnd.ms-powerpoint` - PPT ### Documents - `application/pdf` - PDF files - `text/html` - HTML documents - `application/xml` - XML documents - `text/csv` - CSV files ### Email - `message/rfc822` - EML files - `application/vnd.ms-outlook` - MSG files ### Images - `image/png`, `image/jpeg`, `image/tiff` - Other image formats (with OCR support) ### Rich Text - `application/rtf` - RTF documents - `text/rtf` - Rich text format ### And Many More Unstructured's auto-partitioning attempts to handle virtually any document format. The extractor automatically skips text items (`text/plain`, `text/markdown`) to avoid interfering with the pass-through extractor. ## Configuration ### Config Schema ```python class UnstructuredExtractorConfig(BaseModel): # Version zero provides no configuration options pass ``` ### Configuration Options This extractor currently accepts no configuration. Optional extensions may expose Unstructured library options. ## Usage ### Command Line #### Basic Usage ```bash # Extract from diverse document formats biblicus extract my-corpus --extractor unstructured ``` #### Configuration File ```yaml extractor_id: unstructured config: {} ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract with Unstructured results = corpus.extract_text(extractor_id="unstructured") ``` ### In Pipeline #### Universal Fallback ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: pdf-text - extractor_id: unstructured # Catch-all for remaining formats - extractor_id: select-text ``` #### Last Resort Extraction ```yaml extractor_id: pipeline config: stages: - extractor_id: markitdown - extractor_id: unstructured - extractor_id: select-longest-text ``` ## Examples ### Mixed Format Archive Process a heterogeneous document collection: ```bash biblicus extract archive --extractor unstructured ``` ### Email Corpus Extract text from email archives: ```bash biblicus extract emails --extractor unstructured ``` ### Legacy Document Migration Handle old file formats: ```python from biblicus import Corpus corpus = Corpus.from_directory("legacy-files") results = corpus.extract_text(extractor_id="unstructured") ``` ### Comprehensive Pipeline Maximum format coverage with multiple extractors: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: pdf-text - extractor_id: markitdown - extractor_id: unstructured - extractor_id: select-longest-text ``` ## Output Format Unstructured produces plain text by extracting element text: ### Element Processing 1. Documents are partitioned into elements (paragraphs, tables, etc.) 2. Text is extracted from each element 3. Elements are joined with newlines 4. Empty elements are filtered out ### Example Output Input (DOCX with mixed content): ``` Heading 1 This is a paragraph with some text. • Bullet point 1 • Bullet point 2 Table content... ``` Output (Plain Text): ``` Heading 1 This is a paragraph with some text. Bullet point 1 Bullet point 2 Table content... ``` ## Performance - **Speed**: Moderate to slow (2-30 seconds per document) - **Memory**: Moderate to high (depends on document complexity) - **Format Coverage**: Excellent (broadest coverage) Slower than specialized extractors but handles virtually any format. ## Error Handling ### Missing Dependency If Unstructured is not installed: ``` ExtractionRunFatalError: Unstructured extractor requires an optional dependency. Install it with pip install "biblicus[unstructured]". ``` ### Missing System Libraries If required system libraries are missing, you may see errors related to specific document types. Install required dependencies per the installation section. ### Text Items Text items are silently skipped (returns `None`) to preserve pass-through extractor behavior. ### Unsupported Formats Files that Unstructured cannot process produce empty extracted text and are counted in `extracted_empty_items`. ### Per-Item Errors Processing errors for individual items are recorded but don't halt extraction. ## Use Cases ### Universal Document Processing Handle any document type: ```bash biblicus extract everything --extractor unstructured ``` ### Email Archives Extract text from email collections: ```bash biblicus extract email-archive --extractor unstructured ``` ### Legacy Format Migration Process old or uncommon file formats: ```bash biblicus extract old-docs --extractor unstructured ``` ### Fallback Extractor Use as last resort in pipelines: ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text - extractor_id: markitdown - extractor_id: unstructured - extractor_id: select-text ``` ## When to Use Unstructured vs Alternatives ### Use Unstructured when: - Format coverage is most important - You need to handle diverse, unknown formats - Specialized extractors don't support your formats - You want a universal fallback ### Use MarkItDown when: - Processing primarily Office documents - Python 3.10+ is available - You want Markdown-formatted output - Speed is more important ### Use specialized extractors when: - You know your document formats - Speed is critical - You need format-specific features ### Use VLM extractors when: - Documents have complex visual layouts - You need deep document understanding - Accuracy is more important than speed ## Best Practices ### Test Format Support Always test on representative samples: ```bash biblicus extract test-corpus --extractor unstructured ``` ### Install System Dependencies Ensure required system libraries are installed for full format support. ### Use as Fallback Position Unstructured as a catch-all in pipelines: ```yaml extractor_id: pipeline config: stages: - extractor_id: specialized-extractor - extractor_id: unstructured # Fallback - extractor_id: select-text ``` ### Monitor Performance Track extraction time for large corpora: ```python import time start = time.time() results = corpus.extract_text(extractor_id="unstructured") elapsed = time.time() - start print(f"Extraction took {elapsed:.2f} seconds") ``` ### Handle Empty Results Check statistics for unsupported formats: ```python print(f"Empty items: {results.stats.extracted_empty_items}") print(f"Errored items: {results.stats.errored_items}") ``` ## Comparison with Other Extractors | Feature | Unstructured | MarkItDown | PDF-Text | VLM | |---------|-------------|------------|----------|-----| | Format Coverage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ | | Speed | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐ | | Accuracy | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | Setup Complexity | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | ## Related Extractors ### Same Category - [pass-through-text](pass-through.md) - Direct text file reading - [metadata-text](metadata.md) - Metadata-based text - [pdf-text](pdf.md) - Fast PDF text extraction - [markitdown](markitdown.md) - Office document conversion ### Alternatives - [markitdown](markitdown.md) - Better for Office documents - [docling-smol](../vlm-document/docling-smol.md) - VLM for visual understanding - [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM - [ocr-rapidocr](../ocr/rapidocr.md) - Fast OCR for images ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - First non-empty selection - [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output - [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing - [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction ## See Also - [Text/Document Extractors Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts - [Unstructured.io Documentation](https://unstructured-io.github.io/unstructured/)