Unstructured Extractor

Extractor ID: unstructured

Overview

The Unstructured extractor uses the Unstructured.io library to parse a wide variety of document formats. It’s designed as a universal, last-resort extractor with broad format coverage when specialized extractors aren’t suitable.

Unstructured provides robust handling of diverse document types including Office documents, PDFs, HTML, emails, and many others. Like MarkItDown, it automatically skips text items to preserve canonical text handling.

Installation

Unstructured is an optional dependency:

pip install "biblicus[unstructured]"

System Requirements

Unstructured may require additional system libraries depending on document types:

# Ubuntu/Debian
sudo apt-get install libmagic-dev poppler-utils tesseract-ocr

# macOS
brew install libmagic poppler tesseract

Supported Media Types

Unstructured supports an extensive range of formats:

Office Documents

application/vnd.openxmlformats-officedocument.wordprocessingml.document - DOCX
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - XLSX
application/vnd.openxmlformats-officedocument.presentationml.presentation - PPTX
application/msword - DOC
application/vnd.ms-excel - XLS
application/vnd.ms-powerpoint - PPT

Documents

application/pdf - PDF files
text/html - HTML documents
application/xml - XML documents
text/csv - CSV files

Email

message/rfc822 - EML files
application/vnd.ms-outlook - MSG files

Images

image/png, image/jpeg, image/tiff
Other image formats (with OCR support)

Rich Text

application/rtf - RTF documents
text/rtf - Rich text format

And Many More

Unstructured’s auto-partitioning attempts to handle virtually any document format.

The extractor automatically skips text items (text/plain, text/markdown) to avoid interfering with the pass-through extractor.

Configuration

Config Schema

class UnstructuredExtractorConfig(BaseModel):
    # Version zero provides no configuration options
    pass

Configuration Options

This extractor currently accepts no configuration. Optional extensions may expose Unstructured library options.

Usage

Command Line

Basic Usage

# Extract from diverse document formats
biblicus extract my-corpus --extractor unstructured

Configuration File

extractor_id: unstructured
config: {}

biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with Unstructured
results = corpus.extract_text(extractor_id="unstructured")

In Pipeline

Universal Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: unstructured  # Catch-all for remaining formats
    - extractor_id: select-text

Last Resort Extraction

extractor_id: pipeline
config:
  stages:
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: select-longest-text

Examples

Mixed Format Archive

Process a heterogeneous document collection:

biblicus extract archive --extractor unstructured

Email Corpus

Extract text from email archives:

biblicus extract emails --extractor unstructured

Legacy Document Migration

Handle old file formats:

from biblicus import Corpus

corpus = Corpus.from_directory("legacy-files")
results = corpus.extract_text(extractor_id="unstructured")

Comprehensive Pipeline

Maximum format coverage with multiple extractors:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: pdf-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: select-longest-text

Output Format

Unstructured produces plain text by extracting element text:

Element Processing

Documents are partitioned into elements (paragraphs, tables, etc.)
Text is extracted from each element
Elements are joined with newlines
Empty elements are filtered out

Example Output

Input (DOCX with mixed content):

Heading 1

This is a paragraph with some text.

• Bullet point 1
• Bullet point 2

Table content...

Output (Plain Text):

Heading 1
This is a paragraph with some text.
Bullet point 1
Bullet point 2
Table content...

Performance

Speed: Moderate to slow (2-30 seconds per document)
Memory: Moderate to high (depends on document complexity)
Format Coverage: Excellent (broadest coverage)

Slower than specialized extractors but handles virtually any format.

Error Handling

Missing Dependency

If Unstructured is not installed:

ExtractionRunFatalError: Unstructured extractor requires an optional dependency.
Install it with pip install "biblicus[unstructured]".

Missing System Libraries

If required system libraries are missing, you may see errors related to specific document types. Install required dependencies per the installation section.

Text Items

Text items are silently skipped (returns None) to preserve pass-through extractor behavior.

Unsupported Formats

Files that Unstructured cannot process produce empty extracted text and are counted in extracted_empty_items.

Per-Item Errors

Processing errors for individual items are recorded but don’t halt extraction.

Use Cases

Universal Document Processing

Handle any document type:

biblicus extract everything --extractor unstructured

Email Archives

Extract text from email collections:

biblicus extract email-archive --extractor unstructured

Legacy Format Migration

Process old or uncommon file formats:

biblicus extract old-docs --extractor unstructured

Fallback Extractor

Use as last resort in pipelines:

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: markitdown
    - extractor_id: unstructured
    - extractor_id: select-text

When to Use Unstructured vs Alternatives

Use Unstructured when:

Format coverage is most important
You need to handle diverse, unknown formats
Specialized extractors don’t support your formats
You want a universal fallback

Use MarkItDown when:

Processing primarily Office documents
Python 3.10+ is available
You want Markdown-formatted output
Speed is more important

Use specialized extractors when:

You know your document formats
Speed is critical
You need format-specific features

Use VLM extractors when:

Documents have complex visual layouts
You need deep document understanding
Accuracy is more important than speed

Best Practices

Test Format Support

Always test on representative samples:

biblicus extract test-corpus --extractor unstructured

Install System Dependencies

Ensure required system libraries are installed for full format support.

Use as Fallback

Position Unstructured as a catch-all in pipelines:

extractor_id: pipeline
config:
  stages:
    - extractor_id: specialized-extractor
    - extractor_id: unstructured  # Fallback
    - extractor_id: select-text

Monitor Performance

Track extraction time for large corpora:

import time

start = time.time()
results = corpus.extract_text(extractor_id="unstructured")
elapsed = time.time() - start
print(f"Extraction took {elapsed:.2f} seconds")

Handle Empty Results

Check statistics for unsupported formats:

print(f"Empty items: {results.stats.extracted_empty_items}")
print(f"Errored items: {results.stats.errored_items}")

Comparison with Other Extractors

Feature	Unstructured	MarkItDown	PDF-Text	VLM
Format Coverage	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐	⭐⭐⭐⭐
Speed	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐
Accuracy	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Setup Complexity	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐