SmolDocling-256M Extractor
Extractor ID: docling-smol
Category: Vision-Language Models (VLM)
Overview
The SmolDocling-256M extractor uses IBM Research’s SmolDocling vision-language model for fast, accurate document understanding. It combines visual layout analysis with semantic text extraction to handle complex documents.
SmolDocling-256M is a 256-million parameter VLM optimized for document processing. It achieves 6.15 seconds per page on Apple Silicon with MLX, making it one of the fastest VLM extractors while maintaining excellent accuracy.
Installation
Transformers Backend (Cross-Platform)
pip install "biblicus[docling]"
MLX Backend (Apple Silicon - Recommended)
pip install "biblicus[docling-mlx]"
The MLX backend provides 2-3x faster inference on Apple Silicon (M1/M2/M3/M4) with lower memory usage.
Supported Media Types
application/pdf- PDF documents (digital and scanned)application/vnd.openxmlformats-officedocument.wordprocessingml.document- DOCXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSXapplication/vnd.openxmlformats-officedocument.presentationml.presentation- PPTXtext/html- HTML filesapplication/xhtml+xml- XHTML filesimage/png- PNG imagesimage/jpeg- JPEG imagesimage/gif- GIF imagesimage/webp- WebP imagesimage/tiff- TIFF imagesimage/bmp- BMP images
The extractor automatically skips text items (text/plain, text/markdown) and audio items.
Configuration
Config Schema
class DoclingSmolExtractorConfig(BaseModel):
output_format: str = "markdown" # markdown, text, or html
backend: str = "mlx" # mlx or transformers
Configuration Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Output format: |
|
str |
|
Inference backend: |
Output Formats
markdown (default): Preserves document structure with headings, lists, tables, code blocks
html: Produces semantic HTML with proper tagging
text: Simple plain text without formatting
Usage
Command Line
Basic Usage
# Extract using SmolDocling with defaults (markdown, MLX)
biblicus extract my-corpus --extractor docling-smol
Custom Configuration
# Use Transformers backend with HTML output
biblicus extract my-corpus --extractor docling-smol \
--config output_format=html \
--config backend=transformers
Configuration File
extractor_id: docling-smol
config:
output_format: markdown
backend: mlx
biblicus extract my-corpus --configuration configuration.yml
Python API
from biblicus import Corpus
# Load corpus
corpus = Corpus.from_directory("my-corpus")
# Extract with defaults
results = corpus.extract_text(extractor_id="docling-smol")
# Extract with custom config
results = corpus.extract_text(
extractor_id="docling-smol",
config={
"output_format": "html",
"backend": "transformers"
}
)
In Pipeline
Fallback Chain
extractor_id: select-text
config:
extractors:
- pdf-text # Try text extraction first
- docling-smol # Fall back to VLM
Media Type Routing
extractor_id: select-smart-override
config:
default_extractor: pdf-text
overrides:
- media_type_pattern: "image/.*"
extractor: docling-smol
Examples
Academic Papers
Extract academic papers with equations and code blocks:
biblicus extract papers-corpus --extractor docling-smol \
--config output_format=markdown
Office Documents
Process DOCX, XLSX, PPTX files:
biblicus extract office-corpus --extractor docling-smol \
--config output_format=html
Scanned Documents
OCR scanned PDFs and images:
biblicus extract scans-corpus --extractor docling-smol \
--config backend=mlx
Multi-Format Corpus
Handle mixed document types with automatic routing:
from biblicus import Corpus
corpus = Corpus.from_directory("mixed-corpus")
# Route based on media type
results = corpus.extract_text(
extractor_id="select-smart-override",
config={
"default_extractor": "pass-through-text",
"overrides": [
{"media_type_pattern": "application/pdf", "extractor": "docling-smol"},
{"media_type_pattern": "image/.*", "extractor": "docling-smol"},
{"media_type_pattern": "application/vnd\\.openxmlformats.*", "extractor": "docling-smol"},
]
}
)
Performance
Benchmarks
Speed: 6.15 seconds/page (MLX on Apple Silicon M2)
Tables F1: 0.985
Code F1: 0.980
Equations F1: 0.970
Backend Comparison
Backend |
Platform |
Speed |
Memory |
|---|---|---|---|
MLX |
Apple Silicon |
6.15 sec/page |
Efficient |
Transformers |
Any (CPU/CUDA) |
15-20 sec/page |
Higher |
When to Use SmolDocling vs Granite
SmolDocling-256M: Faster inference, balanced accuracy, good for large corpus processing
Granite Docling-258M: Better accuracy (F1: 0.988 code, 0.992 tables), slower
Error Handling
Missing Dependency
If the Docling library is not installed:
ExtractionRunFatalError: DoclingSmol extractor requires an optional dependency.
Install it with pip install "biblicus[docling]".
Missing MLX Support
If MLX backend is configured but not available:
ExtractionRunFatalError: DoclingSmol extractor with MLX backend requires MLX support.
Install it with pip install "biblicus[docling-mlx]".
Empty Output
Documents that cannot be processed produce empty extracted text and are counted in extracted_empty_items statistics.
Per-Item Errors
Processing errors for individual items are recorded in the extraction snapshot but don’t halt the entire extraction. Check errored_items in extraction statistics.
See Also
extraction.md - Extraction pipeline concepts