SmolDocling-256M Extractor

Extractor ID: docling-smol

Category: Vision-Language Models (VLM)

Overview

The SmolDocling-256M extractor uses IBM Research’s SmolDocling vision-language model for fast, accurate document understanding. It combines visual layout analysis with semantic text extraction to handle complex documents.

SmolDocling-256M is a 256-million parameter VLM optimized for document processing. It achieves 6.15 seconds per page on Apple Silicon with MLX, making it one of the fastest VLM extractors while maintaining excellent accuracy.

Installation

Transformers Backend (Cross-Platform)

pip install "biblicus[docling]"

Supported Media Types

  • application/pdf - PDF documents (digital and scanned)

  • application/vnd.openxmlformats-officedocument.wordprocessingml.document - DOCX

  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - XLSX

  • application/vnd.openxmlformats-officedocument.presentationml.presentation - PPTX

  • text/html - HTML files

  • application/xhtml+xml - XHTML files

  • image/png - PNG images

  • image/jpeg - JPEG images

  • image/gif - GIF images

  • image/webp - WebP images

  • image/tiff - TIFF images

  • image/bmp - BMP images

The extractor automatically skips text items (text/plain, text/markdown) and audio items.

Configuration

Config Schema

class DoclingSmolExtractorConfig(BaseModel):
    output_format: str = "markdown"  # markdown, text, or html
    backend: str = "mlx"              # mlx or transformers

Configuration Options

Option

Type

Default

Description

output_format

str

markdown

Output format: markdown, text, or html

backend

str

mlx

Inference backend: mlx (Apple Silicon) or transformers (cross-platform)

Output Formats

  • markdown (default): Preserves document structure with headings, lists, tables, code blocks

  • html: Produces semantic HTML with proper tagging

  • text: Simple plain text without formatting

Usage

Command Line

Basic Usage

# Extract using SmolDocling with defaults (markdown, MLX)
biblicus extract my-corpus --extractor docling-smol

Custom Configuration

# Use Transformers backend with HTML output
biblicus extract my-corpus --extractor docling-smol \
  --config output_format=html \
  --config backend=transformers

Configuration File

extractor_id: docling-smol
config:
  output_format: markdown
  backend: mlx
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="docling-smol")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="docling-smol",
    config={
        "output_format": "html",
        "backend": "transformers"
    }
)

In Pipeline

Fallback Chain

extractor_id: select-text
config:
  extractors:
    - pdf-text         # Try text extraction first
    - docling-smol     # Fall back to VLM

Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pdf-text
  overrides:
    - media_type_pattern: "image/.*"
      extractor: docling-smol

Examples

Academic Papers

Extract academic papers with equations and code blocks:

biblicus extract papers-corpus --extractor docling-smol \
  --config output_format=markdown

Office Documents

Process DOCX, XLSX, PPTX files:

biblicus extract office-corpus --extractor docling-smol \
  --config output_format=html

Scanned Documents

OCR scanned PDFs and images:

biblicus extract scans-corpus --extractor docling-smol \
  --config backend=mlx

Multi-Format Corpus

Handle mixed document types with automatic routing:

from biblicus import Corpus

corpus = Corpus.from_directory("mixed-corpus")

# Route based on media type
results = corpus.extract_text(
    extractor_id="select-smart-override",
    config={
        "default_extractor": "pass-through-text",
        "overrides": [
            {"media_type_pattern": "application/pdf", "extractor": "docling-smol"},
            {"media_type_pattern": "image/.*", "extractor": "docling-smol"},
            {"media_type_pattern": "application/vnd\\.openxmlformats.*", "extractor": "docling-smol"},
        ]
    }
)

Performance

Benchmarks

  • Speed: 6.15 seconds/page (MLX on Apple Silicon M2)

  • Tables F1: 0.985

  • Code F1: 0.980

  • Equations F1: 0.970

Backend Comparison

Backend

Platform

Speed

Memory

MLX

Apple Silicon

6.15 sec/page

Efficient

Transformers

Any (CPU/CUDA)

15-20 sec/page

Higher

When to Use SmolDocling vs Granite

  • SmolDocling-256M: Faster inference, balanced accuracy, good for large corpus processing

  • Granite Docling-258M: Better accuracy (F1: 0.988 code, 0.992 tables), slower

Error Handling

Missing Dependency

If the Docling library is not installed:

ExtractionRunFatalError: DoclingSmol extractor requires an optional dependency.
Install it with pip install "biblicus[docling]".

Missing MLX Support

If MLX backend is configured but not available:

ExtractionRunFatalError: DoclingSmol extractor with MLX backend requires MLX support.
Install it with pip install "biblicus[docling-mlx]".

Empty Output

Documents that cannot be processed produce empty extracted text and are counted in extracted_empty_items statistics.

Per-Item Errors

Processing errors for individual items are recorded in the extraction snapshot but don’t halt the entire extraction. Check errored_items in extraction statistics.

See Also