OpenAI Whisper Speech-to-Text Extractor

Extractor ID: stt-openai

Category: Speech-to-Text Extractors

Overview

The OpenAI speech-to-text extractor uses OpenAI’s Whisper API to transcribe audio files. It provides high-quality transcription with support for multiple languages, timestamps, and hallucination suppression.

Whisper is a robust, production-ready speech recognition system trained on diverse audio data. The API provides reliable transcription without requiring local model management or GPU resources.

Installation

Install the OpenAI Python client:

pip install "biblicus[openai]"

You’ll also need an OpenAI API key.

Supported Media Types

  • audio/mpeg - MP3 audio

  • audio/mp4 - M4A audio

  • audio/wav - WAV audio

  • audio/webm - WebM audio

  • audio/flac - FLAC audio

  • audio/ogg - OGG audio

  • audio/* - Any audio format supported by OpenAI

Only audio items are processed. Other media types are automatically skipped.

Configuration

Config Schema

class OpenAiSpeechToTextExtractorConfig(BaseModel):
    model: str = "whisper-1"
    response_format: str = "json"
    language: Optional[str] = None
    prompt: Optional[str] = None
    no_speech_probability_threshold: Optional[float] = None

Configuration Options

Option

Type

Default

Description

model

str

whisper-1

OpenAI transcription model

response_format

str

json

Response format: json, verbose_json, text, srt, vtt

language

str or null

null

ISO-639-1 language code hint

prompt

str or null

null

Optional prompt to guide transcription style

no_speech_probability_threshold

float or null

null

Threshold to suppress hallucinations (requires verbose_json)

Response Formats

  • json (default): Simple transcript text

  • verbose_json: Includes segments, timestamps, and no-speech probabilities

  • text: Plain text transcript

  • srt: SubRip subtitle format

  • vtt: WebVTT subtitle format

Usage

Command Line

Basic Usage

# Configure API key
export OPENAI_API_KEY="your-key-here"

# Extract audio transcripts
biblicus extract my-corpus --extractor stt-openai

Custom Configuration

# Transcribe with language hint
biblicus extract my-corpus --extractor stt-openai \
  --config language=es

# Use verbose format with hallucination suppression
biblicus extract my-corpus --extractor stt-openai \
  --config response_format=verbose_json \
  --config no_speech_probability_threshold=0.6

Configuration File

extractor_id: stt-openai
config:
  model: whisper-1
  response_format: json
  language: en
biblicus extract my-corpus --configuration configuration.yml

Python API

from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="stt-openai")

# Extract with language hint
results = corpus.extract_text(
    extractor_id="stt-openai",
    config={"language": "es"}
)

# Extract with hallucination suppression
results = corpus.extract_text(
    extractor_id="stt-openai",
    config={
        "response_format": "verbose_json",
        "no_speech_probability_threshold": 0.6
    }
)

In Pipeline

Audio Fallback

extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: stt-openai
    - extractor_id: select-text

Media Type Routing

extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "audio/.*"
      extractor: stt-openai

Examples

Podcast Transcription

Transcribe podcast episodes:

export OPENAI_API_KEY="your-key"
biblicus extract podcasts --extractor stt-openai

Multilingual Audio

Transcribe audio in multiple languages:

# Spanish audio
biblicus extract spanish-audio --extractor stt-openai \
  --config language=es

# French audio
biblicus extract french-audio --extractor stt-openai \
  --config language=fr

Interview Transcription

Transcribe interviews with custom prompt:

from biblicus import Corpus

corpus = Corpus.from_directory("interviews")

results = corpus.extract_text(
    extractor_id="stt-openai",
    config={
        "prompt": "This is an interview with industry experts discussing technology."
    }
)

Hallucination Suppression

Suppress hallucinated transcripts for silent audio:

biblicus extract audio-clips --extractor stt-openai \
  --config response_format=verbose_json \
  --config no_speech_probability_threshold=0.6

API Configuration

Environment Variable

export OPENAI_API_KEY="your-api-key-here"

User Config File

Add to ~/.biblicus/config.yml:

openai:
  api_key: YOUR_API_KEY_HERE

Local Config File

Add to .biblicus/config.yml in your project:

openai:
  api_key: YOUR_API_KEY_HERE

Language Support

Whisper supports 50+ languages including:

  • English (en)

  • Spanish (es)

  • French (fr)

  • German (de)

  • Italian (it)

  • Portuguese (pt)

  • Dutch (nl)

  • Russian (ru)

  • Chinese (zh)

  • Japanese (ja)

  • Korean (ko)

  • Arabic (ar)

And many more. See OpenAI documentation for the full list.

Performance

  • Speed: ~0.1x realtime (10-minute audio in ~1 minute)

  • Accuracy: Excellent (state-of-the-art for many languages)

  • Cost: Per-minute API pricing (check OpenAI pricing)

Error Handling

Missing Dependency

If OpenAI client is not installed:

ExtractionRunFatalError: OpenAI speech to text extractor requires an optional dependency.
Install it with pip install "biblicus[openai]".

Missing API Key

If API key is not configured:

ExtractionRunFatalError: OpenAI speech to text extractor requires an OpenAI API key.
Set OPENAI_API_KEY or configure it in ~/.biblicus/config.yml or ./.biblicus/config.yml under openai.api_key.

Non-Audio Items

Non-audio items are silently skipped (returns None).

API Errors

API errors (rate limits, invalid audio, etc.) are recorded as per-item errors but don’t halt extraction.

Hallucination Suppression

Whisper may generate “hallucinated” transcripts for silent or noise-only audio. Use no_speech_probability_threshold to suppress these:

config:
  response_format: verbose_json
  no_speech_probability_threshold: 0.6

This requires verbose_json format which includes per-segment no-speech probabilities. If any segment exceeds the threshold, the entire transcript is suppressed (empty output).

Prompt Guidance

The optional prompt parameter guides transcription style:

config:
  prompt: "This is a technical podcast about machine learning and AI."

Prompts can:

  • Provide context about the audio

  • Specify terminology or proper nouns

  • Guide formatting preferences

  • Improve accuracy for domain-specific content

Use Cases

Podcast Archives

Transcribe podcast episodes for search:

biblicus extract podcasts --extractor stt-openai

Meeting Recordings

Create searchable meeting transcripts:

biblicus extract meetings --extractor stt-openai

Lecture Capture

Transcribe educational content:

biblicus extract lectures --extractor stt-openai \
  --config language=en

Multilingual Content

Process audio in multiple languages:

from biblicus import Corpus

# Let Whisper auto-detect language
corpus = Corpus.from_directory("multilingual-audio")
results = corpus.extract_text(extractor_id="stt-openai")

When to Use OpenAI vs Deepgram

Use OpenAI Whisper when:

  • You need excellent multilingual support

  • Audio quality varies

  • You want state-of-the-art accuracy

  • Cost is acceptable

Use Deepgram when:

  • You need faster processing

  • Speaker diarization is required

  • Real-time transcription is needed

  • Lower word error rate for English

Comparison

Feature

OpenAI Whisper

Deepgram

Languages

50+

30+

Speed

Moderate

Fast

Accuracy

Excellent

Excellent

Diarization

No

Yes

Formatting

Basic

Advanced

Best Practices

Provide Language Hints

When you know the language, specify it:

config:
  language: es  # Spanish

Use Prompts for Context

Guide transcription with relevant context:

config:
  prompt: "Interview with Dr. Smith about quantum computing."

Monitor API Usage

Track API costs and usage:

# Check number of items processed
print(f"Processed items: {results.stats.processed_items}")

Suppress Hallucinations

For mixed content (speech + silence), enable suppression:

config:
  response_format: verbose_json
  no_speech_probability_threshold: 0.6

See Also