Speech-to-Text (STT)

Audio transcription extractors for converting spoken content into text.

Overview

Speech-to-text extractors transcribe audio from video and audio files. They are ideal for:

  • Podcast transcription

  • Lecture and presentation recordings

  • Interview transcripts

  • Video content with narration

  • Audio messages and recordings

The raw audio bytes remain unchanged in the corpus; only transcribed text is stored in extraction results.

Available Extractors

stt-openai

OpenAI Whisper API for audio transcription:

  • Model: Whisper-1 (OpenAI hosted)

  • Accuracy: Excellent general-purpose accuracy

  • Languages: 50+ languages supported

  • Features: Automatic language detection, translation

  • Formats: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM

Installation: pip install biblicus[openai]

Best for: General transcription, multi-language content, OpenAI ecosystem integration

stt-deepgram

Deepgram Nova-3 for fast, accurate transcription:

  • Model: Nova-3 (default), Nova-2, other Deepgram models

  • Accuracy: Lower word error rate than Whisper

  • Features: Smart formatting, speaker diarization, filler word filtering

  • Languages: 30+ languages supported

  • Formats: Most audio formats

Installation: pip install biblicus[deepgram]

Best for: High-accuracy transcription, speaker diarization, professional content

stt-aldea

Aldea Speech-to-Text API for audio transcription:

  • API: REST pre-recorded audio (POST /v1/listen)

  • Response: Deepgram-compatible (channels, alternatives, transcript)

  • Features: Language hint, speaker diarization, word timestamps

  • Formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A

Installation: pip install biblicus[aldea]

Best for: Aldea-hosted transcription, Deepgram-compatible workflows

deepgram-transform

Render Deepgram structured metadata into text:

  • Source: transcript, utterances, or words

  • Filters: channel and speaker selection

  • Labels: optional channel/speaker prefixes

Best for: Diarized filtering, channel selection, and structured transcript rendering

Choosing an Extractor

Use Case

Recommended

Notes

General transcription

stt-deepgram

Better accuracy, formatting

Multi-language content

stt-openai

More languages supported

Speaker identification

stt-deepgram

Has diarization feature

Translation to English

stt-openai

Built-in translation

Cost-sensitive

stt-deepgram

Competitive pricing

OpenAI workflow

stt-openai

Single API key

Aldea / Deepgram-shaped API

stt-aldea

Aldea-hosted, same response shape as Deepgram

Performance Comparison

OpenAI Whisper

  • Accuracy: Excellent (WER ~5-10%)

  • Speed: Moderate

  • Languages: 50+

  • Max file size: 25 MB

  • Pricing: $0.006/minute

Deepgram Nova-3

  • Accuracy: Superior (WER ~3-7%)

  • Speed: Fast (real-time capable)

  • Languages: 30+

  • Max file size: No limit

  • Pricing: Competitive (volume discounts)

Common Patterns

Fallback Chain

Try Deepgram first, fall back to OpenAI:

extractor_id: select-text
config:
  extractors:
    - stt-deepgram
    - stt-openai

Language-Specific Routing

Route by media type or use overrides:

extractor_id: select-smart-override
config:
  default_extractor: stt-deepgram
  overrides:
    - media_type_pattern: "audio/.*"
      extractor: stt-deepgram
    - media_type_pattern: "video/.*"
      extractor: stt-openai

Speaker Diarization

Use Deepgram with diarization enabled:

extractor_id: stt-deepgram
config:
  diarize: true
  smart_format: true

Authentication

Both extractors require API keys:

Environment Variables

export OPENAI_API_KEY="your-openai-key"
export DEEPGRAM_API_KEY="your-deepgram-key"

Configuration File

Add to ~/.biblicus/config.yml:

openai:
  api_key: YOUR_OPENAI_KEY

deepgram:
  api_key: YOUR_DEEPGRAM_KEY

aldea:
  api_key: org_YOUR_ALDEA_KEY

Supported Audio Formats

Both extractors support common audio formats:

  • MP3

  • MP4 (audio track)

  • MPEG

  • MPGA

  • M4A

  • WAV

  • WEBM

  • OGG

  • FLAC

See Also