MarkItDown Extractor
Extractor ID: markitdown
Category: Text/Document Extractors
Overview
The MarkItDown extractor uses Microsoft’s MarkItDown library to convert various document formats into Markdown-formatted text. It provides broad format coverage for Office documents, PDFs, images, and other file types.
MarkItDown is designed to produce clean, readable Markdown output from diverse sources. It automatically skips text items to preserve the role of the pass-through extractor for canonical text handling.
Installation
MarkItDown is an optional dependency that requires Python 3.10 or higher:
pip install "biblicus[markitdown]"
Python Version Requirement
Minimum: Python 3.10
Recommended: Python 3.11 or higher
If you’re using Python 3.9 or earlier, use alternative extractors like unstructured.
Supported Media Types
MarkItDown supports a wide range of formats:
Office Documents
application/vnd.openxmlformats-officedocument.wordprocessingml.document- DOCXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSXapplication/vnd.openxmlformats-officedocument.presentationml.presentation- PPTXapplication/msword- DOC (legacy)application/vnd.ms-excel- XLS (legacy)application/vnd.ms-powerpoint- PPT (legacy)
Documents
application/pdf- PDF filestext/html- HTML documentsapplication/xhtml+xml- XHTML documents
Images
image/png,image/jpeg,image/gifimage/bmp,image/tiff,image/webp
Audio/Video
Various audio and video formats (converts metadata)
Archives
application/zip- ZIP archives (lists contents)
The extractor automatically skips text items (text/plain, text/markdown) to avoid interfering with the pass-through extractor.
Configuration
Config Schema
class MarkItDownExtractorConfig(BaseModel):
enable_plugins: bool = False # Enable MarkItDown plugin system
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable MarkItDown’s plugin system for extended format support |
Usage
Command Line
Basic Usage
# Convert Office documents to Markdown
biblicus extract my-corpus --extractor markitdown
Custom Configuration
# Enable plugins for extended format support
biblicus extract my-corpus --extractor markitdown \
--config enable_plugins=true
Configuration File
extractor_id: markitdown
config:
enable_plugins: false
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="markitdown")
# Extract with plugins enabled
results = corpus.extract_text(
extractor_id="markitdown",
config={"enable_plugins": True}
)
In Pipeline
Office Document Pipeline
extractor_id: pipeline
config:
stages:
- extractor_id: pass-through-text # Handle text files
- extractor_id: markitdown # Convert Office docs
- extractor_id: select-text
Media Type Routing
extractor_id: select-smart-override
config:
default_extractor: pass-through-text
overrides:
- media_type_pattern: "application/vnd.openxmlformats.*"
extractor: markitdown
- media_type_pattern: "application/pdf"
extractor: pdf-text
Examples
Office Document Collection
Convert DOCX, XLSX, PPTX files to Markdown:
biblicus extract office-docs --extractor markitdown
Mixed Format Corpus
Handle text, Office, and PDF documents:
from biblicus import Corpus
corpus = Corpus.from_directory("mixed-docs")
results = corpus.extract_text(
extractor_id="pipeline",
config={
"stages": [
{"extractor_id": "pass-through-text"},
{"extractor_id": "markitdown"},
{"extractor_id": "select-text"}
]
}
)
PowerPoint Presentations
Extract text from presentation decks:
biblicus extract presentations --extractor markitdown
Excel Spreadsheets
Convert spreadsheet data to Markdown tables:
biblicus extract spreadsheets --extractor markitdown
Output Format
MarkItDown produces Markdown-formatted text that preserves document structure:
Document Elements
Headings: Converted to Markdown headers (
#,##, etc.)Lists: Preserved as Markdown lists
Tables: Converted to Markdown tables
Links: Preserved as Markdown links
Bold/Italic: Converted to Markdown emphasis
Example Output
Input (DOCX):
Title: Project Report
Subtitle: Q4 2024
Key Findings:
- Revenue increased 25%
- User growth exceeded targets
Output (Markdown):
# Project Report
## Q4 2024
Key Findings:
- Revenue increased 25%
- User growth exceeded targets
Performance
Speed: Moderate (1-10 seconds per document)
Memory: Moderate (depends on document size)
Format Coverage: Excellent (Office, PDF, images, archives)
Faster than VLM approaches but slower than simple text extraction.
Error Handling
Missing Dependency
If MarkItDown is not installed:
ExtractionRunFatalError: MarkItDown extractor requires an optional dependency.
Install it with pip install "biblicus[markitdown]".
Python Version Mismatch
If Python version is below 3.10:
ExtractionRunFatalError: MarkItDown requires Python 3.10 or higher.
Upgrade your interpreter or use a compatible extractor.
Text Items
Text items are silently skipped (returns None) to preserve pass-through extractor behavior.
Unsupported Formats
Files that MarkItDown cannot process produce empty extracted text and are counted in extracted_empty_items.
Per-Item Errors
Processing errors for individual items are recorded but don’t halt extraction.
Use Cases
Office Document Archives
Convert corporate document collections:
biblicus extract corporate-docs --extractor markitdown
Documentation Processing
Handle mixed documentation formats:
biblicus extract documentation --extractor markitdown
Report Extraction
Extract text from formatted reports:
biblicus extract quarterly-reports --extractor markitdown
Knowledge Base Migration
Convert legacy documents to Markdown:
from biblicus import Corpus
corpus = Corpus.from_directory("legacy-kb")
results = corpus.extract_text(extractor_id="markitdown")
When to Use MarkItDown vs Alternatives
Use MarkItDown when:
Processing Office documents (DOCX, XLSX, PPTX)
You want Markdown-formatted output
Python 3.10+ is available
Simple, reliable conversion is needed
Use Unstructured when:
Python 3.9 or earlier is required
More format coverage is needed
You need advanced document parsing
Use VLM extractors when:
Documents have complex layouts
Visual understanding is important
Accuracy is more critical than speed
Use PDF-specific extractors when:
Processing only PDFs
Speed is critical
PDFs are text-based (not scanned)
Best Practices
Test Conversion Quality
Always test on representative samples:
biblicus extract test-corpus --extractor markitdown
Monitor Empty Results
Check extraction statistics for unsupported formats:
print(f"Empty items: {results.stats.extracted_empty_items}")
Use in Pipelines
Combine with other extractors for robustness:
extractor_id: pipeline
config:
stages:
- extractor_id: markitdown
- extractor_id: unstructured # Fallback
- extractor_id: select-longest-text
Check Python Version
Verify Python 3.10+ before deployment:
python --version
See Also
extraction.md - Extraction pipeline concepts