Granite Docling-258M Extractor

Extractor ID: docling-granite

Category: Vision-Language Models (VLM)

Overview

The Granite Docling-258M extractor uses IBM Research’s Granite Docling vision-language model for state-of-the-art document understanding. It achieves superior accuracy on technical content including tables, code blocks, and mathematical equations.

Granite Docling-258M is a 258-million parameter VLM optimized for high-accuracy document processing. It outperforms SmolDocling on complex document structures with F1 scores of 0.988 for code, 0.992 for tables, and 0.975 for equations.

Installation

Transformers Backend (Cross-Platform)

pip install "biblicus[docling]"

Supported Media Types

  • application/pdf - PDF documents (digital and scanned)

  • application/vnd.openxmlformats-officedocument.wordprocessingml.document - DOCX

  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - XLSX

  • application/vnd.openxmlformats-officedocument.presentationml.presentation - PPTX

  • text/html - HTML files

  • application/xhtml+xml - XHTML files

  • image/png - PNG images

  • image/jpeg - JPEG images

  • image/gif - GIF images

  • image/webp - WebP images

  • image/tiff - TIFF images

  • image/bmp - BMP images

The extractor automatically skips text items (text/plain, text/markdown) and audio items.

Configuration

Config Schema

class DoclingGraniteExtractorConfig(BaseModel):
    output_format: str = "markdown"  # markdown, text, or html
    backend: str = "mlx"              # mlx or transformers

Configuration Options

Option

Type

Default

Description

output_format

str

markdown

Output format: markdown, text, or html

backend

str

mlx

Inference backend: mlx (Apple Silicon) or transformers (cross-platform)

Output Formats

  • markdown (default): Preserves document structure with headings, lists, tables, code blocks

  • html: Produces semantic HTML with proper tagging

  • text: Simple plain text without formatting

Usage

Command Line

Basic Usage

# Extract using Granite Docling with defaults (markdown, MLX)
biblicus extract my-corpus --extractor docling-granite

Custom Configuration

# Use Transformers backend with HTML output
biblicus extract my-corpus --extractor docling-granite \
  --config output_format=html \
  --config backend=transformers

Configuration File

extractor_id: docling-granite
config:
  output_format: markdown
  backend: mlx
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="docling-granite")

# Extract with custom config
results = corpus.extract_text(
    extractor_id="docling-granite",
    config={
        "output_format": "html",
        "backend": "transformers"
    }
)

In Pipeline

Fallback Chain

extractor_id: select-text
config:
  extractors:
    - docling-granite  # Highest accuracy
    - docling-smol     # Faster fallback
    - ocr-rapidocr     # Traditional OCR

High-Accuracy Override

extractor_id: select-smart-override
config:
  default_extractor: docling-smol
  overrides:
    - media_type_pattern: "application/pdf"
      extractor: docling-granite  # Use Granite for PDFs

Examples

Academic Papers with Equations

Extract research papers with mathematical equations:

biblicus extract papers-corpus --extractor docling-granite \
  --config output_format=markdown

Technical Documentation

Process documentation with code blocks:

biblicus extract docs-corpus --extractor docling-granite \
  --config output_format=html

Complex Tables

Extract spreadsheets and documents with complex tables:

biblicus extract tables-corpus --extractor docling-granite \
  --config backend=mlx

High-Accuracy Pipeline

Prioritize accuracy over speed:

from biblicus import Corpus

corpus = Corpus.from_directory("important-docs")

# Use Granite for maximum accuracy
results = corpus.extract_text(
    extractor_id="docling-granite",
    config={
        "output_format": "markdown",
        "backend": "mlx"
    }
)

Performance

Benchmarks

  • Speed: ~7 seconds/page (MLX on Apple Silicon, estimated)

  • Tables F1: 0.992 ⭐

  • Code F1: 0.988 ⭐

  • Equations F1: 0.975 ⭐

Comparison with SmolDocling

Metric

Granite-258M

SmolDocling-256M

Tables F1

0.992

0.985

Code F1

0.988

0.980

Equations F1

0.975

0.970

Speed

~7 sec/page

6.15 sec/page

When to Use Granite vs SmolDocling

Use Granite Docling-258M when:

  • Accuracy is critical

  • Processing technical documents (code, equations)

  • Complex table extraction is needed

  • Document quality is worth the extra processing time

Use SmolDocling-256M when:

  • Speed is more important than accuracy

  • Processing large corpus volumes

  • Documents are relatively simple

  • Resource constraints exist

Error Handling

Missing Dependency

If the Docling library is not installed:

ExtractionRunFatalError: DoclingGranite extractor requires an optional dependency.
Install it with pip install "biblicus[docling]".

Missing MLX Support

If MLX backend is configured but not available:

ExtractionRunFatalError: DoclingGranite extractor with MLX backend requires MLX support.
Install it with pip install "biblicus[docling-mlx]".

Empty Output

Documents that cannot be processed produce empty extracted text and are counted in extracted_empty_items statistics.

Per-Item Errors

Processing errors for individual items are recorded in the extraction snapshot but don’t halt the entire extraction. Check errored_items in extraction statistics.

Use Cases

Research Papers

Granite excels at academic papers with:

  • LaTeX-style equations

  • Complex bibliography formatting

  • Multi-column layouts

  • Figures and captions

Source Code Documentation

Ideal for technical documentation with:

  • Syntax-highlighted code blocks

  • API reference tables

  • Inline code snippets

  • Function signatures

Financial Reports

Handles business documents with:

  • Complex financial tables

  • Merged cells and hierarchies

  • Mixed text and numeric data

  • Charts and graphs

See Also