Speech-to-Text (STT)
Audio transcription extractors for converting spoken content into text.
Speech-to-Text Extractors
Overview
Speech-to-text extractors transcribe audio from video and audio files. They are ideal for:
Podcast transcription
Lecture and presentation recordings
Interview transcripts
Video content with narration
Audio messages and recordings
The raw audio bytes remain unchanged in the corpus; only transcribed text is stored in extraction results.
Available Extractors
stt-openai
OpenAI Whisper API for audio transcription:
Model: Whisper-1 (OpenAI hosted)
Accuracy: Excellent general-purpose accuracy
Languages: 50+ languages supported
Features: Automatic language detection, translation
Formats: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM
Installation: pip install biblicus[openai]
Best for: General transcription, multi-language content, OpenAI ecosystem integration
stt-deepgram
Deepgram Nova-3 for fast, accurate transcription:
Model: Nova-3 (default), Nova-2, other Deepgram models
Accuracy: Lower word error rate than Whisper
Features: Smart formatting, speaker diarization, filler word filtering
Languages: 30+ languages supported
Formats: Most audio formats
Installation: pip install biblicus[deepgram]
Best for: High-accuracy transcription, speaker diarization, professional content
stt-aldea
Aldea Speech-to-Text API for audio transcription:
API: REST pre-recorded audio (
POST /v1/listen)Response: Deepgram-compatible (channels, alternatives, transcript)
Features: Language hint, speaker diarization, word timestamps
Formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
Installation: pip install biblicus[aldea]
Best for: Aldea-hosted transcription, Deepgram-compatible workflows
deepgram-transform
Render Deepgram structured metadata into text:
Source: transcript, utterances, or words
Filters: channel and speaker selection
Labels: optional channel/speaker prefixes
Best for: Diarized filtering, channel selection, and structured transcript rendering
Choosing an Extractor
Use Case |
Recommended |
Notes |
|---|---|---|
General transcription |
Better accuracy, formatting |
|
Multi-language content |
More languages supported |
|
Speaker identification |
Has diarization feature |
|
Translation to English |
Built-in translation |
|
Cost-sensitive |
Competitive pricing |
|
OpenAI workflow |
Single API key |
|
Aldea / Deepgram-shaped API |
Aldea-hosted, same response shape as Deepgram |
Performance Comparison
OpenAI Whisper
Accuracy: Excellent (WER ~5-10%)
Speed: Moderate
Languages: 50+
Max file size: 25 MB
Pricing: $0.006/minute
Deepgram Nova-3
Accuracy: Superior (WER ~3-7%)
Speed: Fast (real-time capable)
Languages: 30+
Max file size: No limit
Pricing: Competitive (volume discounts)
Common Patterns
Fallback Chain
Try Deepgram first, fall back to OpenAI:
extractor_id: select-text
config:
extractors:
- stt-deepgram
- stt-openai
Language-Specific Routing
Route by media type or use overrides:
extractor_id: select-smart-override
config:
default_extractor: stt-deepgram
overrides:
- media_type_pattern: "audio/.*"
extractor: stt-deepgram
- media_type_pattern: "video/.*"
extractor: stt-openai
Speaker Diarization
Use Deepgram with diarization enabled:
extractor_id: stt-deepgram
config:
diarize: true
smart_format: true
Authentication
Both extractors require API keys:
Environment Variables
export OPENAI_API_KEY="your-openai-key"
export DEEPGRAM_API_KEY="your-deepgram-key"
Configuration File
Add to ~/.biblicus/config.yml:
openai:
api_key: YOUR_OPENAI_KEY
deepgram:
api_key: YOUR_DEEPGRAM_KEY
aldea:
api_key: org_YOUR_ALDEA_KEY
Supported Audio Formats
Both extractors support common audio formats:
MP3
MP4 (audio track)
MPEG
MPGA
M4A
WAV
WEBM
OGG
FLAC
See Also
stt-openai - OpenAI Whisper extractor details
stt-deepgram - Deepgram Nova-3 extractor details
Pipeline Utilities - Combining extraction strategies