# MarkItDown Extractor **Extractor ID:** `markitdown` **Category:** [Text/Document Extractors](index.md) ## Overview The MarkItDown extractor uses Microsoft's MarkItDown library to convert various document formats into Markdown-formatted text. It provides broad format coverage for Office documents, PDFs, images, and other file types. MarkItDown is designed to produce clean, readable Markdown output from diverse sources. It automatically skips text items to preserve the role of the pass-through extractor for canonical text handling. ## Installation MarkItDown is an optional dependency that requires Python 3.10 or higher: ```bash pip install "biblicus[markitdown]" ``` ### Python Version Requirement - **Minimum**: Python 3.10 - **Recommended**: Python 3.11 or higher If you're using Python 3.9 or earlier, use alternative extractors like `unstructured`. ## Supported Media Types MarkItDown supports a wide range of formats: ### Office Documents - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` - DOCX - `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` - XLSX - `application/vnd.openxmlformats-officedocument.presentationml.presentation` - PPTX - `application/msword` - DOC (legacy) - `application/vnd.ms-excel` - XLS (legacy) - `application/vnd.ms-powerpoint` - PPT (legacy) ### Documents - `application/pdf` - PDF files - `text/html` - HTML documents - `application/xhtml+xml` - XHTML documents ### Images - `image/png`, `image/jpeg`, `image/gif` - `image/bmp`, `image/tiff`, `image/webp` ### Audio/Video - Various audio and video formats (converts metadata) ### Archives - `application/zip` - ZIP archives (lists contents) The extractor automatically skips text items (`text/plain`, `text/markdown`) to avoid interfering with the pass-through extractor. ## Configuration ### Config Schema ```python class MarkItDownExtractorConfig(BaseModel): enable_plugins: bool = False # Enable MarkItDown plugin system ``` ### Configuration Options | Option | Type | Default | Description | |--------|------|---------|-------------| | `enable_plugins` | bool | `false` | Enable MarkItDown's plugin system for extended format support | ## Usage ### Command Line #### Basic Usage ```bash # Convert Office documents to Markdown biblicus extract my-corpus --extractor markitdown ``` #### Custom Configuration ```bash # Enable plugins for extended format support biblicus extract my-corpus --extractor markitdown \ --config enable_plugins=true ``` #### Configuration File ```yaml extractor_id: markitdown config: enable_plugins: false ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract with defaults results = corpus.extract_text(extractor_id="markitdown") # Extract with plugins enabled results = corpus.extract_text( extractor_id="markitdown", config={"enable_plugins": True} ) ``` ### In Pipeline #### Office Document Pipeline ```yaml extractor_id: pipeline config: stages: - extractor_id: pass-through-text # Handle text files - extractor_id: markitdown # Convert Office docs - extractor_id: select-text ``` #### Media Type Routing ```yaml extractor_id: select-smart-override config: default_extractor: pass-through-text overrides: - media_type_pattern: "application/vnd.openxmlformats.*" extractor: markitdown - media_type_pattern: "application/pdf" extractor: pdf-text ``` ## Examples ### Office Document Collection Convert DOCX, XLSX, PPTX files to Markdown: ```bash biblicus extract office-docs --extractor markitdown ``` ### Mixed Format Corpus Handle text, Office, and PDF documents: ```python from biblicus import Corpus corpus = Corpus.from_directory("mixed-docs") results = corpus.extract_text( extractor_id="pipeline", config={ "stages": [ {"extractor_id": "pass-through-text"}, {"extractor_id": "markitdown"}, {"extractor_id": "select-text"} ] } ) ``` ### PowerPoint Presentations Extract text from presentation decks: ```bash biblicus extract presentations --extractor markitdown ``` ### Excel Spreadsheets Convert spreadsheet data to Markdown tables: ```bash biblicus extract spreadsheets --extractor markitdown ``` ## Output Format MarkItDown produces Markdown-formatted text that preserves document structure: ### Document Elements - **Headings**: Converted to Markdown headers (`#`, `##`, etc.) - **Lists**: Preserved as Markdown lists - **Tables**: Converted to Markdown tables - **Links**: Preserved as Markdown links - **Bold/Italic**: Converted to Markdown emphasis ### Example Output Input (DOCX): ``` Title: Project Report Subtitle: Q4 2024 Key Findings: - Revenue increased 25% - User growth exceeded targets ``` Output (Markdown): ```markdown # Project Report ## Q4 2024 Key Findings: - Revenue increased 25% - User growth exceeded targets ``` ## Performance - **Speed**: Moderate (1-10 seconds per document) - **Memory**: Moderate (depends on document size) - **Format Coverage**: Excellent (Office, PDF, images, archives) Faster than VLM approaches but slower than simple text extraction. ## Error Handling ### Missing Dependency If MarkItDown is not installed: ``` ExtractionRunFatalError: MarkItDown extractor requires an optional dependency. Install it with pip install "biblicus[markitdown]". ``` ### Python Version Mismatch If Python version is below 3.10: ``` ExtractionRunFatalError: MarkItDown requires Python 3.10 or higher. Upgrade your interpreter or use a compatible extractor. ``` ### Text Items Text items are silently skipped (returns `None`) to preserve pass-through extractor behavior. ### Unsupported Formats Files that MarkItDown cannot process produce empty extracted text and are counted in `extracted_empty_items`. ### Per-Item Errors Processing errors for individual items are recorded but don't halt extraction. ## Use Cases ### Office Document Archives Convert corporate document collections: ```bash biblicus extract corporate-docs --extractor markitdown ``` ### Documentation Processing Handle mixed documentation formats: ```bash biblicus extract documentation --extractor markitdown ``` ### Report Extraction Extract text from formatted reports: ```bash biblicus extract quarterly-reports --extractor markitdown ``` ### Knowledge Base Migration Convert legacy documents to Markdown: ```python from biblicus import Corpus corpus = Corpus.from_directory("legacy-kb") results = corpus.extract_text(extractor_id="markitdown") ``` ## When to Use MarkItDown vs Alternatives ### Use MarkItDown when: - Processing Office documents (DOCX, XLSX, PPTX) - You want Markdown-formatted output - Python 3.10+ is available - Simple, reliable conversion is needed ### Use Unstructured when: - Python 3.9 or earlier is required - More format coverage is needed - You need advanced document parsing ### Use VLM extractors when: - Documents have complex layouts - Visual understanding is important - Accuracy is more critical than speed ### Use PDF-specific extractors when: - Processing only PDFs - Speed is critical - PDFs are text-based (not scanned) ## Best Practices ### Test Conversion Quality Always test on representative samples: ```bash biblicus extract test-corpus --extractor markitdown ``` ### Monitor Empty Results Check extraction statistics for unsupported formats: ```python print(f"Empty items: {results.stats.extracted_empty_items}") ``` ### Use in Pipelines Combine with other extractors for robustness: ```yaml extractor_id: pipeline config: stages: - extractor_id: markitdown - extractor_id: unstructured # Fallback - extractor_id: select-longest-text ``` ### Check Python Version Verify Python 3.10+ before deployment: ```bash python --version ``` ## Related Extractors ### Same Category - [pass-through-text](pass-through.md) - Direct text file reading - [metadata-text](metadata.md) - Metadata-based text - [pdf-text](pdf.md) - Fast PDF text extraction - [unstructured](unstructured.md) - Universal document parser ### Alternatives - [unstructured](unstructured.md) - More format coverage, Python 3.9 support - [docling-smol](../vlm-document/docling-smol.md) - VLM for complex documents - [docling-granite](../vlm-document/docling-granite.md) - High-accuracy VLM ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - First non-empty selection - [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output - [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing - [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction ## See Also - [Text/Document Extractors Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts - [MarkItDown GitHub](https://github.com/microsoft/markitdown)