# Granite Docling-258M Extractor **Extractor ID:** `docling-granite` **Category:** [Vision-Language Models (VLM)](index.md) ## Overview The Granite Docling-258M extractor uses IBM Research's Granite Docling vision-language model for state-of-the-art document understanding. It achieves superior accuracy on technical content including tables, code blocks, and mathematical equations. Granite Docling-258M is a 258-million parameter VLM optimized for high-accuracy document processing. It outperforms SmolDocling on complex document structures with F1 scores of 0.988 for code, 0.992 for tables, and 0.975 for equations. ## Installation ### Transformers Backend (Cross-Platform) ```bash pip install "biblicus[docling]" ``` ### MLX Backend (Apple Silicon - Recommended) ```bash pip install "biblicus[docling-mlx]" ``` The MLX backend provides 2-3x faster inference on Apple Silicon (M1/M2/M3/M4) with lower memory usage. ## Supported Media Types - `application/pdf` - PDF documents (digital and scanned) - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` - DOCX - `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` - XLSX - `application/vnd.openxmlformats-officedocument.presentationml.presentation` - PPTX - `text/html` - HTML files - `application/xhtml+xml` - XHTML files - `image/png` - PNG images - `image/jpeg` - JPEG images - `image/gif` - GIF images - `image/webp` - WebP images - `image/tiff` - TIFF images - `image/bmp` - BMP images The extractor automatically skips text items (`text/plain`, `text/markdown`) and audio items. ## Configuration ### Config Schema ```python class DoclingGraniteExtractorConfig(BaseModel): output_format: str = "markdown" # markdown, text, or html backend: str = "mlx" # mlx or transformers ``` ### Configuration Options | Option | Type | Default | Description | |--------|------|---------|-------------| | `output_format` | str | `markdown` | Output format: `markdown`, `text`, or `html` | | `backend` | str | `mlx` | Inference backend: `mlx` (Apple Silicon) or `transformers` (cross-platform) | ### Output Formats - **markdown** (default): Preserves document structure with headings, lists, tables, code blocks - **html**: Produces semantic HTML with proper tagging - **text**: Simple plain text without formatting ## Usage ### Command Line #### Basic Usage ```bash # Extract using Granite Docling with defaults (markdown, MLX) biblicus extract my-corpus --extractor docling-granite ``` #### Custom Configuration ```bash # Use Transformers backend with HTML output biblicus extract my-corpus --extractor docling-granite \ --config output_format=html \ --config backend=transformers ``` #### Configuration File ```yaml extractor_id: docling-granite config: output_format: markdown backend: mlx ``` ```bash biblicus extract my-corpus --configuration configuration.yml ``` ### Python API ```python from biblicus import Corpus # Load corpus corpus = Corpus.from_directory("my-corpus") # Extract with defaults results = corpus.extract_text(extractor_id="docling-granite") # Extract with custom config results = corpus.extract_text( extractor_id="docling-granite", config={ "output_format": "html", "backend": "transformers" } ) ``` ### In Pipeline #### Fallback Chain ```yaml extractor_id: select-text config: extractors: - docling-granite # Highest accuracy - docling-smol # Faster fallback - ocr-rapidocr # Traditional OCR ``` #### High-Accuracy Override ```yaml extractor_id: select-smart-override config: default_extractor: docling-smol overrides: - media_type_pattern: "application/pdf" extractor: docling-granite # Use Granite for PDFs ``` ## Examples ### Academic Papers with Equations Extract research papers with mathematical equations: ```bash biblicus extract papers-corpus --extractor docling-granite \ --config output_format=markdown ``` ### Technical Documentation Process documentation with code blocks: ```bash biblicus extract docs-corpus --extractor docling-granite \ --config output_format=html ``` ### Complex Tables Extract spreadsheets and documents with complex tables: ```bash biblicus extract tables-corpus --extractor docling-granite \ --config backend=mlx ``` ### High-Accuracy Pipeline Prioritize accuracy over speed: ```python from biblicus import Corpus corpus = Corpus.from_directory("important-docs") # Use Granite for maximum accuracy results = corpus.extract_text( extractor_id="docling-granite", config={ "output_format": "markdown", "backend": "mlx" } ) ``` ## Performance ### Benchmarks - **Speed**: ~7 seconds/page (MLX on Apple Silicon, estimated) - **Tables F1**: 0.992 ⭐ - **Code F1**: 0.988 ⭐ - **Equations F1**: 0.975 ⭐ ### Comparison with SmolDocling | Metric | Granite-258M | SmolDocling-256M | |--------|--------------|------------------| | Tables F1 | **0.992** | 0.985 | | Code F1 | **0.988** | 0.980 | | Equations F1 | **0.975** | 0.970 | | Speed | ~7 sec/page | 6.15 sec/page | ### When to Use Granite vs SmolDocling **Use Granite Docling-258M when:** - Accuracy is critical - Processing technical documents (code, equations) - Complex table extraction is needed - Document quality is worth the extra processing time **Use [SmolDocling-256M](docling-smol.md) when:** - Speed is more important than accuracy - Processing large corpus volumes - Documents are relatively simple - Resource constraints exist ## Error Handling ### Missing Dependency If the Docling library is not installed: ``` ExtractionRunFatalError: DoclingGranite extractor requires an optional dependency. Install it with pip install "biblicus[docling]". ``` ### Missing MLX Support If MLX backend is configured but not available: ``` ExtractionRunFatalError: DoclingGranite extractor with MLX backend requires MLX support. Install it with pip install "biblicus[docling-mlx]". ``` ### Empty Output Documents that cannot be processed produce empty extracted text and are counted in `extracted_empty_items` statistics. ### Per-Item Errors Processing errors for individual items are recorded in the extraction snapshot but don't halt the entire extraction. Check `errored_items` in extraction statistics. ## Use Cases ### Research Papers Granite excels at academic papers with: - LaTeX-style equations - Complex bibliography formatting - Multi-column layouts - Figures and captions ### Source Code Documentation Ideal for technical documentation with: - Syntax-highlighted code blocks - API reference tables - Inline code snippets - Function signatures ### Financial Reports Handles business documents with: - Complex financial tables - Merged cells and hierarchies - Mixed text and numeric data - Charts and graphs ### Legal Documents Processes legal content with: - Multi-level numbering - Nested clauses - Citation formatting - Footnotes and references ## Related Extractors ### Same Category - [docling-smol](docling-smol.md) - SmolDocling-256M for faster processing ### Alternatives - [ocr-rapidocr](../ocr/rapidocr.md) - Traditional OCR (faster, less accurate) - [ocr-paddleocr-vl](../ocr/paddleocr-vl.md) - PaddleOCR VL (good for CJK) - [markitdown](../text-document/markitdown.md) - MarkItDown for Office docs (no VLM) ### Pipeline Utilities - [select-text](../pipeline-utilities/select-text.md) - Fallback chain - [select-longest-text](../pipeline-utilities/select-longest.md) - Select best output - [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing ## See Also - [VLM Document Understanding Overview](index.md) - [Extractors Index](../index.md) - [extraction.md](../../extraction.md) - Extraction pipeline concepts