Unstructured Extractor
Extractor ID: unstructured
Category: Text/Document Extractors
Overview
The Unstructured extractor uses the Unstructured.io library to parse a wide variety of document formats. It’s designed as a universal, last-resort extractor with broad format coverage when specialized extractors aren’t suitable.
Unstructured provides robust handling of diverse document types including Office documents, PDFs, HTML, emails, and many others. Like MarkItDown, it automatically skips text items to preserve canonical text handling.
Installation
Unstructured is an optional dependency:
pip install "biblicus[unstructured]"
System Requirements
Unstructured may require additional system libraries depending on document types:
# Ubuntu/Debian
sudo apt-get install libmagic-dev poppler-utils tesseract-ocr
# macOS
brew install libmagic poppler tesseract
Supported Media Types
Unstructured supports an extensive range of formats:
Office Documents
application/vnd.openxmlformats-officedocument.wordprocessingml.document- DOCXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSXapplication/vnd.openxmlformats-officedocument.presentationml.presentation- PPTXapplication/msword- DOCapplication/vnd.ms-excel- XLSapplication/vnd.ms-powerpoint- PPT
Documents
application/pdf- PDF filestext/html- HTML documentsapplication/xml- XML documentstext/csv- CSV files
Email
message/rfc822- EML filesapplication/vnd.ms-outlook- MSG files
Images
image/png,image/jpeg,image/tiffOther image formats (with OCR support)
Rich Text
application/rtf- RTF documentstext/rtf- Rich text format
And Many More
Unstructured’s auto-partitioning attempts to handle virtually any document format.
The extractor automatically skips text items (text/plain, text/markdown) to avoid interfering with the pass-through extractor.
Configuration
Config Schema
class UnstructuredExtractorConfig(BaseModel):
# Version zero provides no configuration options
pass
Configuration Options
This extractor currently accepts no configuration. Optional extensions may expose Unstructured library options.
Usage
Command Line
Basic Usage
# Extract from diverse document formats
biblicus extract my-corpus --extractor unstructured
Configuration File
extractor_id: unstructured
config: {}
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with Unstructured
results = corpus.extract_text(extractor_id="unstructured")
In Pipeline
Universal Fallback
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: unstructured # Catch-all for remaining formats
- extractor_id: select-text
Last Resort Extraction
extractor_id: pipeline
config:
stages:
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: select-longest-text
Examples
Mixed Format Archive
Process a heterogeneous document collection:
biblicus extract archive --extractor unstructured
Email Corpus
Extract text from email archives:
biblicus extract emails --extractor unstructured
Legacy Document Migration
Handle old file formats:
from biblicus import Corpus
corpus = Corpus.from_directory("legacy-files")
results = corpus.extract_text(extractor_id="unstructured")
Comprehensive Pipeline
Maximum format coverage with multiple extractors:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: pdf-text
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: select-longest-text
Output Format
Unstructured produces plain text by extracting element text:
Element Processing
Documents are partitioned into elements (paragraphs, tables, etc.)
Text is extracted from each element
Elements are joined with newlines
Empty elements are filtered out
Example Output
Input (DOCX with mixed content):
Heading 1
This is a paragraph with some text.
• Bullet point 1
• Bullet point 2
Table content...
Output (Plain Text):
Heading 1
This is a paragraph with some text.
Bullet point 1
Bullet point 2
Table content...
Performance
Speed: Moderate to slow (2-30 seconds per document)
Memory: Moderate to high (depends on document complexity)
Format Coverage: Excellent (broadest coverage)
Slower than specialized extractors but handles virtually any format.
Error Handling
Missing Dependency
If Unstructured is not installed:
ExtractionRunFatalError: Unstructured extractor requires an optional dependency.
Install it with pip install "biblicus[unstructured]".
Missing System Libraries
If required system libraries are missing, you may see errors related to specific document types. Install required dependencies per the installation section.
Text Items
Text items are silently skipped (returns None) to preserve pass-through extractor behavior.
Unsupported Formats
Files that Unstructured cannot process produce empty extracted text and are counted in extracted_empty_items.
Per-Item Errors
Processing errors for individual items are recorded but don’t halt extraction.
Use Cases
Universal Document Processing
Handle any document type:
biblicus extract everything --extractor unstructured
Email Archives
Extract text from email collections:
biblicus extract email-archive --extractor unstructured
Legacy Format Migration
Process old or uncommon file formats:
biblicus extract old-docs --extractor unstructured
Fallback Extractor
Use as last resort in pipelines:
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text
- extractor_id: markitdown
- extractor_id: unstructured
- extractor_id: select-text
When to Use Unstructured vs Alternatives
Use Unstructured when:
Format coverage is most important
You need to handle diverse, unknown formats
Specialized extractors don’t support your formats
You want a universal fallback
Use MarkItDown when:
Processing primarily Office documents
Python 3.10+ is available
You want Markdown-formatted output
Speed is more important
Use specialized extractors when:
You know your document formats
Speed is critical
You need format-specific features
Use VLM extractors when:
Documents have complex visual layouts
You need deep document understanding
Accuracy is more important than speed
Best Practices
Test Format Support
Always test on representative samples:
biblicus extract test-corpus --extractor unstructured
Install System Dependencies
Ensure required system libraries are installed for full format support.
Use as Fallback
Position Unstructured as a catch-all in pipelines:
extractor_id: pipeline
config:
stages:
- extractor_id: specialized-extractor
- extractor_id: unstructured # Fallback
- extractor_id: select-text
Monitor Performance
Track extraction time for large corpora:
import time
start = time.time()
results = corpus.extract_text(extractor_id="unstructured")
elapsed = time.time() - start
print(f"Extraction took {elapsed:.2f} seconds")
Handle Empty Results
Check statistics for unsupported formats:
print(f"Empty items: {results.stats.extracted_empty_items}")
print(f"Errored items: {results.stats.errored_items}")
Comparison with Other Extractors
Feature |
Unstructured |
MarkItDown |
PDF-Text |
VLM |
|---|---|---|---|---|
Format Coverage |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐ |
⭐ |
⭐⭐⭐⭐ |
Speed |
⭐⭐ |
⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐ |
Accuracy |
⭐⭐⭐ |
⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
Setup Complexity |
⭐⭐ |
⭐⭐⭐⭐ |
⭐⭐⭐⭐⭐ |
⭐⭐ |
See Also
extraction.md - Extraction pipeline concepts