STT Provider Benchmarking Guide
Comprehensive guide for benchmarking Speech-to-Text providers using Biblicus with real-world results.
New to benchmarking? Start with the Benchmarking Overview for a platform introduction.
Table of Contents
Overview
Biblicus provides a robust framework for evaluating and comparing Speech-to-Text providers using standardized metrics on labeled audio datasets. This guide provides practical instructions and real benchmark results.
Key Features:
6 integrated STT providers (AWS, Aldea, Deepgram, OpenAI, Faster-Whisper, GPT-4o Audio)
Multiple evaluation metrics (WER, CER, precision, recall, F1)
500+ LibriSpeech samples ready for testing
Automated ground truth comparison
JSON export for detailed analysis
Benchmark Results Summary
120-Sample LibriSpeech test-clean Evaluation
Comprehensive benchmark results from 120 professionally recorded audiobook samples:
Provider |
WER |
CER |
F1 Score |
Speed |
Cost |
Verdict |
|---|---|---|---|---|---|---|
AWS Transcribe |
3.57% |
1.01% |
0.963 |
~12s/file |
$$ |
🥇 Best accuracy, slower |
Aldea |
3.60% |
1.27% |
0.962 |
~1.5s/file |
$ |
🥈 Excellent balance |
Deepgram Nova-3 |
3.76% |
1.15% |
0.962 |
~1s/file |
$$ |
🥉 Fastest, great accuracy |
OpenAI Whisper |
4.30% |
1.32% |
0.964 |
~1.5s/file |
$ |
✅ Good, widely available |
Faster-Whisper |
4.33% |
1.29% |
0.964 |
Local |
Free |
✅ Best for offline/free |
GPT-4o Audio |
45.11% |
31.75% |
0.847 |
~1.3s/file |
$$$ |
❌ Not suitable for STT |
Key Findings:
Top tier providers (AWS, Aldea, Deepgram) achieve ~3.6-3.8% WER on clean speech
Differences between top 3 are statistically insignificant at this sample size
Deepgram offers best speed-accuracy tradeoff for production use
Faster-Whisper matches OpenAI Whisper accuracy at zero cost
GPT-4o Audio is 10x worse - use specialized STT models instead
Recommendation by Use Case:
Production (balanced): Deepgram Nova-3 - fastest with top-tier accuracy
Production (max accuracy): AWS Transcribe - slightly better WER, slower
Cost-sensitive: Faster-Whisper large-v3 - local, free, matches paid APIs
General purpose: OpenAI Whisper - widely available, good performance
Avoid: GPT-4o Audio - not optimized for transcription
Supported STT Providers
Currently Integrated (6 providers)
1. AWS Transcribe
Extractor ID:
stt-aws-transcribeAccuracy: ⭐ 3.57% WER (Best)
Speed: Slow (~12s per file due to S3 upload)
Cost: $$ (~$1.50 per 100 files)
Requirements: AWS credentials, S3 bucket
Best for: Maximum accuracy requirements
2. Aldea
Extractor ID:
stt-aldeaAccuracy: ⭐ 3.60% WER
Speed: Fast (~1.5s per file)
Cost: $ (~$0.50 per 100 files)
Requirements: Aldea API key
Best for: Excellent balance of speed, cost, accuracy
3. Deepgram Nova-3
Extractor ID:
stt-deepgramAccuracy: ⭐ 3.76% WER
Speed: Fastest (~1s per file)
Cost: $$ (~$0.80 per 100 files)
Requirements: Deepgram API key
Best for: Production workloads requiring speed
4. OpenAI Whisper (API)
Extractor ID:
stt-openaiAccuracy: ✅ 4.30% WER
Speed: Fast (~1.5s per file)
Cost: $ (~$0.60 per 100 files)
Requirements: OpenAI API key
Best for: General purpose, widely available
5. Faster-Whisper (Local)
Extractor ID:
stt-faster-whisperModel: large-v3 with CTranslate2
Accuracy: ✅ 4.33% WER
Speed: Slow (local CPU/GPU processing)
Cost: Free (local inference)
Requirements: Local compute, ~3GB model download
Best for: Offline, privacy-sensitive, or cost-sensitive applications
6. GPT-4o Audio (Not Recommended)
Extractor ID:
stt-openai-audioAccuracy: ❌ 45.11% WER (Poor)
Speed: Fast (~1.3s per file)
Cost: $$$ (~$2.00 per 100 files)
Note: Multimodal model not optimized for transcription
Recommendation: Use specialized STT models instead
Quick Start
1. Download Benchmark Dataset
# Download 100 LibriSpeech test-clean samples
python scripts/download_librispeech_samples.py \
--corpus corpora/librispeech_benchmark \
--count 100
# For larger benchmarks (500 samples for statistical significance)
python scripts/download_librispeech_samples.py \
--corpus corpora/librispeech_benchmark \
--count 500 \
--force
# Download challenging test-other subset
python scripts/download_openslr_samples.py \
--corpus corpora/librispeech_test_other \
--dataset SLR12 \
--subset test-other \
--count 100
# 3. View results
cat results/stt_benchmark.json | jq '.providers[] | {name, wer: .metrics.wer.avg}'
Benchmark Dataset
LibriSpeech Test-Clean
What is LibriSpeech?
Corpus of ~1000 hours of 16kHz read English speech
Derived from LibriVox audiobook recordings
Test-clean subset: 5.4 hours of high-quality audio
~2600 utterances with ground truth transcriptions
Download:
python scripts/download_librispeech_samples.py \
--corpus corpora/librispeech_benchmark \
--count 100
This will:
Download LibriSpeech test-clean (~346 MB)
Extract FLAC audio files
Parse ground truth transcriptions from .trans.txt files
Create corpus at
corpora/librispeech_benchmark/Store ground truth in
metadata/ground_truth/
Dataset structure:
corpora/librispeech_benchmark/
├── metadata/
│ ├── config.json
│ ├── catalog.json
│ └── ground_truth/
│ ├── <audio-id>.txt # Ground truth transcription
│ └── ...
├── <audio-id>.flac # Audio file
└── ...
Using Your Own Dataset
To benchmark on custom audio:
Prepare ground truth files:
corpus_dir/metadata/ground_truth/ ├── <audio-id>.txt └── ...
Ingest audio:
from biblicus import Corpus from pathlib import Path corpus = Corpus(Path("my_corpus")) corpus.ingest_file("audio.flac", tags=["benchmark", "speech"])
Run evaluation:
from biblicus.evaluation.stt_benchmark import STTBenchmark benchmark = STTBenchmark(corpus) report = benchmark.evaluate_extraction( snapshot_reference="<snapshot-id>", ground_truth_dir=corpus.root / "metadata" / "ground_truth" )
Available STT Providers
Biblicus includes 3 STT provider integrations:
Provider |
Model |
Strengths |
API Required |
|---|---|---|---|
OpenAI Whisper |
whisper-1 |
General-purpose, multilingual |
OPENAI_API_KEY |
Deepgram Nova-3 |
nova-3 |
Fast, accurate, features |
DEEPGRAM_API_KEY |
Aldea |
(default) |
Custom STT service |
ALDEA_API_KEY |
Understanding Metrics
Biblicus STT benchmarks use three categories of metrics:
Word Error Rate (WER)
Primary metric for STT accuracy.
WER = (Substitutions + Deletions + Insertions) / Total Words
Substitutions: Wrong word transcribed
Deletions: Word missed
Insertions: Extra word added
Lower is better (0.0 = perfect, 1.0 = all words wrong)
Interpretation:
WER < 0.05: Excellent (human-level)
WER < 0.10: Very good (production-ready)
WER < 0.20: Good (acceptable for many uses)
WER > 0.30: Poor (needs improvement)
Character Error Rate (CER)
Character-level accuracy metric.
CER = (Character Substitutions + Deletions + Insertions) / Total Characters
More granular than WER
Better for languages without clear word boundaries
Lower is better
Word-Level Metrics
Precision, Recall, F1 Score (bag-of-words)
Precision: What % of transcribed words are correct
Recall: What % of actual words were transcribed
F1 Score: Harmonic mean of precision and recall
These ignore word order and focus on vocabulary accuracy.
For detailed explanations, see the Metrics Reference.
Running Benchmarks
Benchmark All Providers
python scripts/benchmark_all_stt_providers.py \
--corpus corpora/librispeech_benchmark \
--output results/stt_comparison.json
What it does:
Tests all 3 STT providers (OpenAI, Deepgram, Aldea)
Transcribes all audio files in the corpus
Calculates WER, CER, and word-level metrics
Generates comprehensive comparison report
Output:
results/stt_comparison.json- Full results with all metricsConsole table showing provider comparison
Example output:
================================================================================
COMPREHENSIVE STT PROVIDER COMPARISON
================================================================================
Provider WER CER Precision Recall F1 Status
------------------------------------------------------------------------------------------------
OpenAI Whisper 0.045 0.023 0.982 0.975 0.978 ✓ OK
Deepgram Nova-3 0.052 0.028 0.976 0.968 0.972 ✓ OK
Aldea 0.068 0.035 0.965 0.952 0.958 ✓ OK
------------------------------------------------------------------------------------------------
🏆 Best Word Accuracy (Lowest WER): OpenAI Whisper
WER: 0.045, Median: 0.042
🏆 Best Character Accuracy (Lowest CER): OpenAI Whisper
CER: 0.023, Median: 0.021
🏆 Best Word Finding (F1): OpenAI Whisper
F1: 0.978, Precision: 0.982, Recall: 0.975
Benchmark Specific Providers
# Test only OpenAI and Deepgram
python scripts/benchmark_all_stt_providers.py \
--corpus corpora/librispeech_benchmark \
--providers stt-openai stt-deepgram \
--output results/openai_vs_deepgram.json
Quick Test with Fewer Files
python scripts/benchmark_all_stt_providers.py \
--corpus corpora/librispeech_benchmark \
--limit 5 \
--output results/quick_test.json
Results Analysis
JSON Output Structure
{
"benchmark_timestamp": "2026-02-13T20:00:00",
"corpus_path": "corpora/librispeech_benchmark",
"total_providers": 3,
"successful_providers": 3,
"failed_providers": 0,
"providers": [
{
"name": "OpenAI Whisper",
"snapshot_id": "abc123...",
"success": true,
"metrics": {
"wer": {
"avg": 0.045,
"median": 0.042,
"avg_substitutions": 2.3,
"avg_deletions": 0.8,
"avg_insertions": 0.5
},
"cer": {
"avg": 0.023,
"median": 0.021
},
"word_level": {
"avg_precision": 0.982,
"avg_recall": 0.975,
"avg_f1": 0.978,
"median_f1": 0.980
}
},
"provider_configuration": { ... },
"total_audio_files": 100
}
],
"best_performers": {
"lowest_wer": "OpenAI Whisper",
"lowest_cer": "OpenAI Whisper",
"best_f1": "OpenAI Whisper"
}
}
Analyzing Results
Find best provider:
cat results/stt_comparison.json | jq '.providers | sort_by(.metrics.wer.avg) | .[0] | {name, wer: .metrics.wer.avg}'
Compare WER across providers:
cat results/stt_comparison.json | jq '.providers[] | {name, wer: .metrics.wer.avg, cer: .metrics.cer.avg}' | jq -s 'sort_by(.wer)'
Find audio files with high WER:
import json
with open('results/stt_comparison.json') as f:
data = json.load(f)
# Assuming per-audio results are stored (implementation detail)
# You can analyze which utterances have highest error rates
Dependencies
Installing STT Provider Dependencies
Different providers require different dependencies:
OpenAI Whisper:
pip install "biblicus[openai]"
export OPENAI_API_KEY="sk-..."
Deepgram:
pip install "biblicus[deepgram]"
export DEEPGRAM_API_KEY="..."
Aldea:
pip install "biblicus[aldea]"
export ALDEA_API_KEY="..."
All providers:
pip install "biblicus[openai,deepgram,aldea]"
Checking Dependencies
# Test OpenAI
python -c "from openai import OpenAI; print('OpenAI OK')"
# Test Deepgram
python -c "from deepgram import DeepgramClient; print('Deepgram OK')"
# Test Aldea
python -c "import httpx; print('Aldea OK')"
Troubleshooting
Common Issues
Issue: “API key not found”
Solution: Set environment variable or configure in ~/.biblicus/config.yml:
openai:
api_key: "sk-..."
deepgram:
api_key: "..."
aldea:
api_key: "..."
Issue: “Ground truth directory not found”
Solution: Run python scripts/download_librispeech_samples.py first
Issue: “Audio format not supported”
Solution: LibriSpeech uses FLAC format. Ensure audio files are valid FLAC.
Issue: “API rate limit exceeded”
Solution:
- Reduce --limit parameter
- Add delays between API calls
- Use provider with higher rate limits
Issue: “Results don’t match expected WER”
Solution: Check:
- Correct ground truth files loaded
- Audio quality is good
- Provider configuration is appropriate
- Punctuation/formatting settings
Issue: “High API costs”
Solution:
- Start with --limit 5 or --limit 10 for testing
- Use quick.yaml config (20 files) instead of full (2600 files)
- Calculate costs: ~$0.006 per minute of audio (varies by provider)
Benchmark Modes
Use configuration files for different benchmark scales:
Quick (20 audio files, ~2-5 minutes):
# Manually specify count
python scripts/download_librispeech_samples.py --corpus corpora/librispeech_benchmark --count 20
python scripts/benchmark_all_stt_providers.py --corpus corpora/librispeech_benchmark --output results/quick.json
Standard (100 audio files, ~10-20 minutes):
python scripts/download_librispeech_samples.py --corpus corpora/librispeech_benchmark --count 100
python scripts/benchmark_all_stt_providers.py --corpus corpora/librispeech_benchmark --output results/standard.json
Full (all audio files, ~2-4 hours):
# Download full test-clean dataset
python scripts/download_librispeech_samples.py --corpus corpora/librispeech_benchmark --count 2600
python scripts/benchmark_all_stt_providers.py --corpus corpora/librispeech_benchmark --output results/full.json
Cost Estimation
STT benchmarking involves API costs:
Provider |
Cost per Hour |
20 Files (~10 min) |
100 Files (~50 min) |
Full (~5.4 hours) |
|---|---|---|---|---|
OpenAI Whisper |
$0.36/hr |
~$0.06 |
~$0.30 |
~$1.94 |
Deepgram Nova-3 |
$0.36/hr |
~$0.06 |
~$0.30 |
~$1.94 |
Aldea |
(varies) |
(varies) |
(varies) |
(varies) |
Cost-saving tips:
Start with
--limit 5for developmentUse quick config (20 files) for iteration
Run full benchmarks only for final results
See Also
Benchmarking Documentation:
Benchmarking Overview - Platform introduction
Metrics Reference - Understanding WER, CER, and other metrics
OCR Benchmarking Guide - OCR pipeline benchmarking
STT Documentation:
Speech to Text - STT extraction guide
External References: