Vision-Language Models (VLM) for Document Understanding

Advanced VLM-based document understanding with layout analysis, semantic structure preservation, and intelligent content extraction.

VLM Extractors

Overview

Vision-Language Model extractors use multimodal AI to understand documents holistically, combining visual layout analysis with semantic text extraction. They excel at:

Complex document layouts (multi-column, mixed content)
Mathematical equations and chemical formulas
Code blocks and syntax preservation
Tables with complex structure
Diagrams and figure captions
Academic papers and technical documentation

Available Extractors

docling-smol 

SmolDocling-256M provides fast, efficient document understanding:

Model: SmolDocling-256M (256M parameters)
Speed: 6.15 seconds/page (MLX on Apple Silicon)
Formats: PDF, DOCX, XLSX, PPTX, HTML, images
Backends: MLX (Apple Silicon) or Transformers (cross-platform)
Output: Markdown, HTML, or plain text

Installation:

Transformers: pip install biblicus[docling]
MLX (Apple Silicon): pip install biblicus[docling-mlx]

Best for: General document processing, fast inference, balanced accuracy

docling-granite 

Granite Docling-258M provides state-of-the-art accuracy:

Model: Granite Docling-258M (258M parameters)
Accuracy: Superior F1 scores (0.988 code, 0.992 tables, 0.975 equations)
Formats: PDF, DOCX, XLSX, PPTX, HTML, images
Backends: MLX (Apple Silicon) or Transformers (cross-platform)
Output: Markdown, HTML, or plain text

Installation:

Transformers: pip install biblicus[docling]
MLX (Apple Silicon): pip install biblicus[docling-mlx]

Best for: High-accuracy extraction, tables, equations, code blocks

VLM vs Traditional OCR

Use VLM When:

Complex layouts: Multi-column papers, mixed content
Structured data: Tables with merged cells, complex hierarchy
Technical content: Equations, code blocks, chemical formulas
Semantic understanding: Need layout-aware markdown output
Quality priority: Accuracy more important than speed

Use Traditional OCR When:

Simple text recognition: Plain scanned documents
CPU-only constraints: No GPU/MLX acceleration available
Speed priority: Need fastest possible processing
Lightweight deployment: Minimal dependencies required

See OCR Extractors for traditional OCR options.

Choosing a VLM Extractor

Use Case	Recommended	Notes
General documents	docling-smol	Fast, balanced accuracy
Academic papers	docling-granite	Better equation/code recognition
Business documents	docling-smol	Good table extraction
Technical documentation	docling-granite	Superior code block handling
Large corpus processing	docling-smol	Faster inference
Maximum accuracy	docling-granite	Best overall metrics

Performance Comparison

SmolDocling-256M

Parameters: 256M
Speed: 6.15 sec/page (MLX)
Tables F1: 0.985
Code F1: 0.980
Equations F1: 0.970

Granite Docling-258M

Parameters: 258M
Speed: ~7 sec/page (MLX, estimated)
Tables F1: 0.992
Code F1: 0.988
Equations F1: 0.975

Backend Options

MLX Backend (Apple Silicon)

Platform: macOS with Apple Silicon (M1/M2/M3/M4)
Performance: 2-3x faster than Transformers
Memory: Efficient unified memory usage
Installation: pip install biblicus[docling-mlx]

Transformers Backend (Cross-Platform)

Platform: Any platform (CPU, CUDA, ROCm)
Performance: Slower but widely compatible
Memory: Standard PyTorch memory requirements
Installation: pip install biblicus[docling]

Output Formats

All VLM extractors support multiple output formats:

Markdown (Default)

Preserves document structure with headings, lists, tables, code blocks:

extractor_id: docling-smol
config:
  output_format: markdown  # default

HTML

Produces semantic HTML with proper tagging:

extractor_id: docling-smol
config:
  output_format: html

Plain Text

Simple text output without formatting:

extractor_id: docling-smol
config:
  output_format: text

Common Patterns

Fallback to OCR

Try VLM first, fall back to traditional OCR:

extractor_id: select-text
config:
  extractors:
    - docling-granite
    - ocr-rapidocr

Speed vs Accuracy Trade-off

Use SmolDocling for speed, Granite for accuracy:

extractor_id: select-smart-override
config:
  default_extractor: docling-smol
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: docling-granite  # Higher accuracy for important PDFs

Backend Selection

Choose backend based on platform:

# MLX for Apple Silicon (fast)
extractor_id: docling-smol
config:
  backend: mlx

# Transformers for other platforms (compatible)
extractor_id: docling-smol
config:
  backend: transformers

Installation Guide

Apple Silicon (Recommended)

# Install with MLX backend for best performance
pip install biblicus[docling-mlx]

Other Platforms

# Install with Transformers backend
pip install biblicus[docling]

Both Extractors

# Install all Docling dependencies
pip install biblicus[docling-mlx]  # Includes base docling extras

Supported Document Types

PDF: Scanned and digital PDFs
DOCX: Microsoft Word documents
XLSX: Excel spreadsheets
PPTX: PowerPoint presentations
HTML: Web pages
Images: PNG, JPEG, SVG, and other image formats