# SmolDocling-256M Extractor **Extractor ID:** `docling-smol` **Category:** [Vision-Language Models (VLM)](index.md) ## Overview The SmolDocling-256M extractor uses IBM Research's SmolDocling vision-language model for fast, accurate document understanding. It combines visual layout analysis with semantic text extraction to handle complex documents. SmolDocling-256M is a 256-million parameter VLM optimized for document processing. It achieves 6.15 seconds per page on Apple Silicon with MLX, making it one of the fastest VLM extractors while maintaining excellent accuracy. ## Installation ### Transformers Backend (Cross-Platform) ```bash pip install "biblicus[docling]" ``` ### MLX Backend (Apple Silicon - Recommended) ```bash pip install "biblicus[docling-mlx]" ``` The MLX backend provides 2-3x faster inference on Apple Silicon (M1/M2/M3/M4) with lower memory usage. ## Supported Media Types - `application/pdf` - PDF documents (digital and scanned) - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` - DOCX - `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` - XLSX - `application/vnd.openxmlformats-officedocument.presentationml.presentation` - PPTX - `text/html` - HTML files - `application/xhtml+xml` - XHTML files - `image/png` - PNG images - `image/jpeg` - JPEG images - `image/gif` - GIF images - `image/webp` - WebP images - `image/tiff` - TIFF images - `image/bmp` - BMP images The extractor automatically skips text items (`text/plain`, `text/markdown`) and audio items. ## Configuration ### Config Schema ```python class DoclingSmolExtractorConfig(BaseModel): output_format: str = "markdown" # markdown, text, or html backend: str = "mlx" # mlx or transformers ``` ### Configuration Options | Option | Type | Default | Description | |--------|------|---------|-------------| | `output_format` | str | `markdown` | Output format: `markdown`, `text`, or `html` | | `backend` | str | `mlx` | Inference backend: `mlx` (Apple Silicon) or `transformers` (cross-platform) | ### Output Formats - **markdown** (default): Preserves document structure with headings, lists, tables, code blocks - **html**: Produces semantic HTML with proper tagging - **text**: Simple plain text without formatting ## Usage ### Command Line #### Basic Usage ```bash # Extract using SmolDocling with defaults (markdown, MLX) biblicus extract my-corpus --extractor docling-smol ``` #### Custom Configuration ```bash # Use Transformers backend with HTML output biblicus extract my-corpus --extractor docling-smol \ --config output_format=html \ --config backend=transformers ``` #### Configuration File ```yaml extractor_id: docling-smol config: output_format: markdown backend: mlx ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract with defaults results = corpus.extract_text(extractor_id="docling-smol") # Extract with custom config results = corpus.extract_text( extractor_id="docling-smol", config={ "output_format": "html", "backend": "transformers" } ) ``` ### In Pipeline #### Fallback Chain ```yaml extractor_id: select-text config: extractors: - pdf-text # Try text extraction first - docling-smol # Fall back to VLM ``` #### Media Type Routing ```yaml extractor_id: select-smart-override config: default_extractor: pdf-text overrides: - media_type_pattern: "image/.*" extractor: docling-smol ``` ## Examples ### Academic Papers Extract academic papers with equations and code blocks: ```bash biblicus extract papers-corpus --extractor docling-smol \ --config output_format=markdown ``` ### Office Documents Process DOCX, XLSX, PPTX files: ```bash biblicus extract office-corpus --extractor docling-smol \ --config output_format=html ``` ### Scanned Documents OCR scanned PDFs and images: ```bash biblicus extract scans-corpus --extractor docling-smol \ --config backend=mlx ``` ### Multi-Format Corpus Handle mixed document types with automatic routing: ```python from biblicus import Corpus corpus = Corpus.from_directory("mixed-corpus") # Route based on media type results = corpus.extract_text( extractor_id="select-smart-override", config={ "default_extractor": "pass-through-text", "overrides": [ {"media_type_pattern": "application/pdf", "extractor": "docling-smol"}, {"media_type_pattern": "image/.*", "extractor": "docling-smol"}, {"media_type_pattern": "application/vnd\\.openxmlformats.*", "extractor": "docling-smol"}, ] } ) ``` ## Performance ### Benchmarks - **Speed**: 6.15 seconds/page (MLX on Apple Silicon M2) - **Tables F1**: 0.985 - **Code F1**: 0.980 - **Equations F1**: 0.970 ### Backend Comparison | Backend | Platform | Speed | Memory | |---------|----------|-------|--------| | MLX | Apple Silicon | 6.15 sec/page | Efficient | | Transformers | Any (CPU/CUDA) | 15-20 sec/page | Higher | ### When to Use SmolDocling vs Granite - **SmolDocling-256M**: Faster inference, balanced accuracy, good for large corpus processing - **[Granite Docling-258M](docling-granite.md)**: Better accuracy (F1: 0.988 code, 0.992 tables), slower ## Error Handling ### Missing Dependency If the Docling library is not installed: ``` ExtractionRunFatalError: DoclingSmol extractor requires an optional dependency. Install it with pip install "biblicus[docling]". ``` ### Missing MLX Support If MLX backend is configured but not available: ``` ExtractionRunFatalError: DoclingSmol extractor with MLX backend requires MLX support. Install it with pip install "biblicus[docling-mlx]". ``` ### Empty Output Documents that cannot be processed produce empty extracted text and are counted in `extracted_empty_items` statistics. ### Per-Item Errors Processing errors for individual items are recorded in the extraction snapshot but don't halt the entire extraction. Check `errored_items` in extraction statistics. ## Related Extractors ### Same Category - [docling-granite](docling-granite.md) - Granite Docling-258M for higher accuracy ### Alternatives - [ocr-rapidocr](../ocr/rapidocr.md) - Traditional OCR (faster, less accurate) - [ocr-paddleocr-vl](../ocr/paddleocr-vl.md) - PaddleOCR VL (good for CJK) - [markitdown](../text-document/markitdown.md) - MarkItDown for Office docs (no VLM) ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - Fallback chain - [select-longest-text](../pipeline-utilities/select-longest.md) - Select best output - [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing ## See Also - [VLM Document Understanding Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts