Vision-Language Models (VLM) for Document Understanding
Advanced VLM-based document understanding with layout analysis, semantic structure preservation, and intelligent content extraction.
VLM Extractors
Overview
Vision-Language Model extractors use multimodal AI to understand documents holistically, combining visual layout analysis with semantic text extraction. They excel at:
Complex document layouts (multi-column, mixed content)
Mathematical equations and chemical formulas
Code blocks and syntax preservation
Tables with complex structure
Diagrams and figure captions
Academic papers and technical documentation
Available Extractors
docling-smol
SmolDocling-256M provides fast, efficient document understanding:
Model: SmolDocling-256M (256M parameters)
Speed: 6.15 seconds/page (MLX on Apple Silicon)
Formats: PDF, DOCX, XLSX, PPTX, HTML, images
Backends: MLX (Apple Silicon) or Transformers (cross-platform)
Output: Markdown, HTML, or plain text
Installation:
Transformers:
pip install biblicus[docling]MLX (Apple Silicon):
pip install biblicus[docling-mlx]
Best for: General document processing, fast inference, balanced accuracy
docling-granite
Granite Docling-258M provides state-of-the-art accuracy:
Model: Granite Docling-258M (258M parameters)
Accuracy: Superior F1 scores (0.988 code, 0.992 tables, 0.975 equations)
Formats: PDF, DOCX, XLSX, PPTX, HTML, images
Backends: MLX (Apple Silicon) or Transformers (cross-platform)
Output: Markdown, HTML, or plain text
Installation:
Transformers:
pip install biblicus[docling]MLX (Apple Silicon):
pip install biblicus[docling-mlx]
Best for: High-accuracy extraction, tables, equations, code blocks
VLM vs Traditional OCR
Use VLM When:
Complex layouts: Multi-column papers, mixed content
Structured data: Tables with merged cells, complex hierarchy
Technical content: Equations, code blocks, chemical formulas
Semantic understanding: Need layout-aware markdown output
Quality priority: Accuracy more important than speed
Use Traditional OCR When:
Simple text recognition: Plain scanned documents
CPU-only constraints: No GPU/MLX acceleration available
Speed priority: Need fastest possible processing
Lightweight deployment: Minimal dependencies required
See OCR Extractors for traditional OCR options.
Choosing a VLM Extractor
Use Case |
Recommended |
Notes |
|---|---|---|
General documents |
Fast, balanced accuracy |
|
Academic papers |
Better equation/code recognition |
|
Business documents |
Good table extraction |
|
Technical documentation |
Superior code block handling |
|
Large corpus processing |
Faster inference |
|
Maximum accuracy |
Best overall metrics |
Performance Comparison
SmolDocling-256M
Parameters: 256M
Speed: 6.15 sec/page (MLX)
Tables F1: 0.985
Code F1: 0.980
Equations F1: 0.970
Granite Docling-258M
Parameters: 258M
Speed: ~7 sec/page (MLX, estimated)
Tables F1: 0.992
Code F1: 0.988
Equations F1: 0.975
Backend Options
MLX Backend (Apple Silicon)
Platform: macOS with Apple Silicon (M1/M2/M3/M4)
Performance: 2-3x faster than Transformers
Memory: Efficient unified memory usage
Installation:
pip install biblicus[docling-mlx]
Transformers Backend (Cross-Platform)
Platform: Any platform (CPU, CUDA, ROCm)
Performance: Slower but widely compatible
Memory: Standard PyTorch memory requirements
Installation:
pip install biblicus[docling]
Output Formats
All VLM extractors support multiple output formats:
Markdown (Default)
Preserves document structure with headings, lists, tables, code blocks:
extractor_id: docling-smol
config:
output_format: markdown # default
HTML
Produces semantic HTML with proper tagging:
extractor_id: docling-smol
config:
output_format: html
Plain Text
Simple text output without formatting:
extractor_id: docling-smol
config:
output_format: text
Common Patterns
Fallback to OCR
Try VLM first, fall back to traditional OCR:
extractor_id: select-text
config:
extractors:
- docling-granite
- ocr-rapidocr
Speed vs Accuracy Trade-off
Use SmolDocling for speed, Granite for accuracy:
extractor_id: select-smart-override
config:
default_extractor: docling-smol
overrides:
- media_type_pattern: "application/pdf"
extractor: docling-granite # Higher accuracy for important PDFs
Backend Selection
Choose backend based on platform:
# MLX for Apple Silicon (fast)
extractor_id: docling-smol
config:
backend: mlx
# Transformers for other platforms (compatible)
extractor_id: docling-smol
config:
backend: transformers
Installation Guide
Apple Silicon (Recommended)
# Install with MLX backend for best performance
pip install biblicus[docling-mlx]
Other Platforms
# Install with Transformers backend
pip install biblicus[docling]
Both Extractors
# Install all Docling dependencies
pip install biblicus[docling-mlx] # Includes base docling extras
Supported Document Types
PDF: Scanned and digital PDFs
DOCX: Microsoft Word documents
XLSX: Excel spreadsheets
PPTX: PowerPoint presentations
HTML: Web pages
Images: PNG, JPEG, SVG, and other image formats
See Also
docling-smol - SmolDocling-256M extractor details
docling-granite - Granite Docling-258M extractor details
OCR Extractors - Traditional OCR alternatives
Pipeline Utilities - Combining extraction strategies