Granite Docling-258M Extractor
Extractor ID: docling-granite
Category: Vision-Language Models (VLM)
Overview
The Granite Docling-258M extractor uses IBM Research’s Granite Docling vision-language model for state-of-the-art document understanding. It achieves superior accuracy on technical content including tables, code blocks, and mathematical equations.
Granite Docling-258M is a 258-million parameter VLM optimized for high-accuracy document processing. It outperforms SmolDocling on complex document structures with F1 scores of 0.988 for code, 0.992 for tables, and 0.975 for equations.
Installation
Transformers Backend (Cross-Platform)
pip install "biblicus[docling]"
MLX Backend (Apple Silicon - Recommended)
pip install "biblicus[docling-mlx]"
The MLX backend provides 2-3x faster inference on Apple Silicon (M1/M2/M3/M4) with lower memory usage.
Supported Media Types
application/pdf- PDF documents (digital and scanned)application/vnd.openxmlformats-officedocument.wordprocessingml.document- DOCXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSXapplication/vnd.openxmlformats-officedocument.presentationml.presentation- PPTXtext/html- HTML filesapplication/xhtml+xml- XHTML filesimage/png- PNG imagesimage/jpeg- JPEG imagesimage/gif- GIF imagesimage/webp- WebP imagesimage/tiff- TIFF imagesimage/bmp- BMP images
The extractor automatically skips text items (text/plain, text/markdown) and audio items.
Configuration
Config Schema
class DoclingGraniteExtractorConfig(BaseModel):
output_format: str = "markdown" # markdown, text, or html
backend: str = "mlx" # mlx or transformers
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Output format: |
|
str |
|
Inference backend: |
Output Formats
markdown (default): Preserves document structure with headings, lists, tables, code blocks
html: Produces semantic HTML with proper tagging
text: Simple plain text without formatting
Usage
Command Line
Basic Usage
# Extract using Granite Docling with defaults (markdown, MLX)
biblicus extract my-corpus --extractor docling-granite
Custom Configuration
# Use Transformers backend with HTML output
biblicus extract my-corpus --extractor docling-granite \
--config output_format=html \
--config backend=transformers
Configuration File
extractor_id: docling-granite
config:
output_format: markdown
backend: mlx
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="docling-granite")
# Extract with custom config
results = corpus.extract_text(
extractor_id="docling-granite",
config={
"output_format": "html",
"backend": "transformers"
}
)
In Pipeline
Fallback Chain
extractor_id: select-text
config:
extractors:
- docling-granite # Highest accuracy
- docling-smol # Faster fallback
- ocr-rapidocr # Traditional OCR
High-Accuracy Override
extractor_id: select-smart-override
config:
default_extractor: docling-smol
overrides:
- media_type_pattern: "application/pdf"
extractor: docling-granite # Use Granite for PDFs
Examples
Academic Papers with Equations
Extract research papers with mathematical equations:
biblicus extract papers-corpus --extractor docling-granite \
--config output_format=markdown
Technical Documentation
Process documentation with code blocks:
biblicus extract docs-corpus --extractor docling-granite \
--config output_format=html
Complex Tables
Extract spreadsheets and documents with complex tables:
biblicus extract tables-corpus --extractor docling-granite \
--config backend=mlx
High-Accuracy Pipeline
Prioritize accuracy over speed:
from biblicus import Corpus
corpus = Corpus.from_directory("important-docs")
# Use Granite for maximum accuracy
results = corpus.extract_text(
extractor_id="docling-granite",
config={
"output_format": "markdown",
"backend": "mlx"
}
)
Performance
Benchmarks
Speed: ~7 seconds/page (MLX on Apple Silicon, estimated)
Tables F1: 0.992 ⭐
Code F1: 0.988 ⭐
Equations F1: 0.975 ⭐
Comparison with SmolDocling
Metric |
Granite-258M |
SmolDocling-256M |
|---|---|---|
Tables F1 |
0.992 |
0.985 |
Code F1 |
0.988 |
0.980 |
Equations F1 |
0.975 |
0.970 |
Speed |
~7 sec/page |
6.15 sec/page |
When to Use Granite vs SmolDocling
Use Granite Docling-258M when:
Accuracy is critical
Processing technical documents (code, equations)
Complex table extraction is needed
Document quality is worth the extra processing time
Use SmolDocling-256M when:
Speed is more important than accuracy
Processing large corpus volumes
Documents are relatively simple
Resource constraints exist
Error Handling
Missing Dependency
If the Docling library is not installed:
ExtractionRunFatalError: DoclingGranite extractor requires an optional dependency.
Install it with pip install "biblicus[docling]".
Missing MLX Support
If MLX backend is configured but not available:
ExtractionRunFatalError: DoclingGranite extractor with MLX backend requires MLX support.
Install it with pip install "biblicus[docling-mlx]".
Empty Output
Documents that cannot be processed produce empty extracted text and are counted in extracted_empty_items statistics.
Per-Item Errors
Processing errors for individual items are recorded in the extraction snapshot but don’t halt the entire extraction. Check errored_items in extraction statistics.
Use Cases
Research Papers
Granite excels at academic papers with:
LaTeX-style equations
Complex bibliography formatting
Multi-column layouts
Figures and captions
Source Code Documentation
Ideal for technical documentation with:
Syntax-highlighted code blocks
API reference tables
Inline code snippets
Function signatures
Financial Reports
Handles business documents with:
Complex financial tables
Merged cells and hierarchies
Mixed text and numeric data
Charts and graphs
Legal Documents
Processes legal content with:
Multi-level numbering
Nested clauses
Citation formatting
Footnotes and references
See Also
extraction.md - Extraction pipeline concepts