# OpenAI Whisper Speech-to-Text Extractor

**Extractor ID:** `stt-openai`

**Category:** [Speech-to-Text Extractors](index.md)

## Overview

The OpenAI speech-to-text extractor uses OpenAI's Whisper API to transcribe audio files. It provides high-quality transcription with support for multiple languages, timestamps, and hallucination suppression.

Whisper is a robust, production-ready speech recognition system trained on diverse audio data. The API provides reliable transcription without requiring local model management or GPU resources.

## Installation

Install the OpenAI Python client:

```bash
pip install "biblicus[openai]"
```

You'll also need an OpenAI API key.

## Supported Media Types

- `audio/mpeg` - MP3 audio
- `audio/mp4` - M4A audio
- `audio/wav` - WAV audio
- `audio/webm` - WebM audio
- `audio/flac` - FLAC audio
- `audio/ogg` - OGG audio
- `audio/*` - Any audio format supported by OpenAI

Only audio items are processed. Other media types are automatically skipped.

## Configuration

### Config Schema

```python
class OpenAiSpeechToTextExtractorConfig(BaseModel):
    model: str = "whisper-1"
    response_format: str = "json"
    language: Optional[str] = None
    prompt: Optional[str] = None
    no_speech_probability_threshold: Optional[float] = None
```

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `model` | str | `whisper-1` | OpenAI transcription model |
| `response_format` | str | `json` | Response format: `json`, `verbose_json`, `text`, `srt`, `vtt` |
| `language` | str or null | `null` | ISO-639-1 language code hint |
| `prompt` | str or null | `null` | Optional prompt to guide transcription style |
| `no_speech_probability_threshold` | float or null | `null` | Threshold to suppress hallucinations (requires `verbose_json`) |

### Response Formats

- **json** (default): Simple transcript text
- **verbose_json**: Includes segments, timestamps, and no-speech probabilities
- **text**: Plain text transcript
- **srt**: SubRip subtitle format
- **vtt**: WebVTT subtitle format

## Usage

### Command Line

#### Basic Usage

```bash
# Configure API key
export OPENAI_API_KEY="your-key-here"

# Extract audio transcripts
biblicus extract my-corpus --extractor stt-openai
```

#### Custom Configuration

```bash
# Transcribe with language hint
biblicus extract my-corpus --extractor stt-openai \
  --config language=es

# Use verbose format with hallucination suppression
biblicus extract my-corpus --extractor stt-openai \
  --config response_format=verbose_json \
  --config no_speech_probability_threshold=0.6
```

#### Configuration File

```yaml
extractor_id: stt-openai
config:
  model: whisper-1
  response_format: json
  language: en
```

```bash
biblicus extract my-corpus --configuration configuration.yml
```

### Python API

```python
from biblicus import Corpus

# Load corpus
corpus = Corpus.from_directory("my-corpus")

# Extract with defaults
results = corpus.extract_text(extractor_id="stt-openai")

# Extract with language hint
results = corpus.extract_text(
    extractor_id="stt-openai",
    config={"language": "es"}
)

# Extract with hallucination suppression
results = corpus.extract_text(
    extractor_id="stt-openai",
    config={
        "response_format": "verbose_json",
        "no_speech_probability_threshold": 0.6
    }
)
```

### In Pipeline

#### Audio Fallback

```yaml
extractor_id: pipeline
config:
  stages:
    - extractor_id: pass-through-text
    - extractor_id: stt-openai
    - extractor_id: select-text
```

#### Media Type Routing

```yaml
extractor_id: select-smart-override
config:
  default_extractor: pass-through-text
  overrides:
    - media_type_pattern: "audio/.*"
      extractor: stt-openai
```

## Examples

### Podcast Transcription

Transcribe podcast episodes:

```bash
export OPENAI_API_KEY="your-key"
biblicus extract podcasts --extractor stt-openai
```

### Multilingual Audio

Transcribe audio in multiple languages:

```bash
# Spanish audio
biblicus extract spanish-audio --extractor stt-openai \
  --config language=es

# French audio
biblicus extract french-audio --extractor stt-openai \
  --config language=fr
```

### Interview Transcription

Transcribe interviews with custom prompt:

```python
from biblicus import Corpus

corpus = Corpus.from_directory("interviews")

results = corpus.extract_text(
    extractor_id="stt-openai",
    config={
        "prompt": "This is an interview with industry experts discussing technology."
    }
)
```

### Hallucination Suppression

Suppress hallucinated transcripts for silent audio:

```bash
biblicus extract audio-clips --extractor stt-openai \
  --config response_format=verbose_json \
  --config no_speech_probability_threshold=0.6
```

## API Configuration

### Environment Variable

```bash
export OPENAI_API_KEY="your-api-key-here"
```

### User Config File

Add to `~/.biblicus/config.yml`:

```yaml
openai:
  api_key: YOUR_API_KEY_HERE
```

### Local Config File

Add to `.biblicus/config.yml` in your project:

```yaml
openai:
  api_key: YOUR_API_KEY_HERE
```

## Language Support

Whisper supports 50+ languages including:

- English (`en`)
- Spanish (`es`)
- French (`fr`)
- German (`de`)
- Italian (`it`)
- Portuguese (`pt`)
- Dutch (`nl`)
- Russian (`ru`)
- Chinese (`zh`)
- Japanese (`ja`)
- Korean (`ko`)
- Arabic (`ar`)

And many more. See OpenAI documentation for the full list.

## Performance

- **Speed**: ~0.1x realtime (10-minute audio in ~1 minute)
- **Accuracy**: Excellent (state-of-the-art for many languages)
- **Cost**: Per-minute API pricing (check OpenAI pricing)

## Error Handling

### Missing Dependency

If OpenAI client is not installed:

```
ExtractionRunFatalError: OpenAI speech to text extractor requires an optional dependency.
Install it with pip install "biblicus[openai]".
```

### Missing API Key

If API key is not configured:

```
ExtractionRunFatalError: OpenAI speech to text extractor requires an OpenAI API key.
Set OPENAI_API_KEY or configure it in ~/.biblicus/config.yml or ./.biblicus/config.yml under openai.api_key.
```

### Non-Audio Items

Non-audio items are silently skipped (returns `None`).

### API Errors

API errors (rate limits, invalid audio, etc.) are recorded as per-item errors but don't halt extraction.

## Hallucination Suppression

Whisper may generate "hallucinated" transcripts for silent or noise-only audio. Use `no_speech_probability_threshold` to suppress these:

```yaml
config:
  response_format: verbose_json
  no_speech_probability_threshold: 0.6
```

This requires `verbose_json` format which includes per-segment no-speech probabilities. If any segment exceeds the threshold, the entire transcript is suppressed (empty output).

### Recommended Threshold

- **0.5-0.6**: Conservative (suppress likely hallucinations)
- **0.7-0.8**: Moderate (suppress obvious hallucinations)
- **0.9+**: Aggressive (only keep very confident speech)

## Prompt Guidance

The optional `prompt` parameter guides transcription style:

```yaml
config:
  prompt: "This is a technical podcast about machine learning and AI."
```

Prompts can:
- Provide context about the audio
- Specify terminology or proper nouns
- Guide formatting preferences
- Improve accuracy for domain-specific content

## Use Cases

### Podcast Archives

Transcribe podcast episodes for search:

```bash
biblicus extract podcasts --extractor stt-openai
```

### Meeting Recordings

Create searchable meeting transcripts:

```bash
biblicus extract meetings --extractor stt-openai
```

### Lecture Capture

Transcribe educational content:

```bash
biblicus extract lectures --extractor stt-openai \
  --config language=en
```

### Multilingual Content

Process audio in multiple languages:

```python
from biblicus import Corpus

# Let Whisper auto-detect language
corpus = Corpus.from_directory("multilingual-audio")
results = corpus.extract_text(extractor_id="stt-openai")
```

## When to Use OpenAI vs Deepgram

### Use OpenAI Whisper when:
- You need excellent multilingual support
- Audio quality varies
- You want state-of-the-art accuracy
- Cost is acceptable

### Use Deepgram when:
- You need faster processing
- Speaker diarization is required
- Real-time transcription is needed
- Lower word error rate for English

### Comparison

| Feature | OpenAI Whisper | Deepgram |
|---------|---------------|----------|
| Languages | 50+ | 30+ |
| Speed | Moderate | Fast |
| Accuracy | Excellent | Excellent |
| Diarization | No | Yes |
| Formatting | Basic | Advanced |

## Best Practices

### Provide Language Hints

When you know the language, specify it:

```yaml
config:
  language: es  # Spanish
```

### Use Prompts for Context

Guide transcription with relevant context:

```yaml
config:
  prompt: "Interview with Dr. Smith about quantum computing."
```

### Monitor API Usage

Track API costs and usage:

```python
# Check number of items processed
print(f"Processed items: {results.stats.processed_items}")
```

### Suppress Hallucinations

For mixed content (speech + silence), enable suppression:

```yaml
config:
  response_format: verbose_json
  no_speech_probability_threshold: 0.6
```

## Related Extractors

### Same Category

- [stt-deepgram](deepgram.md) - Deepgram speech-to-text

### Alternatives

- [stt-deepgram](deepgram.md) - Faster, includes diarization
- [pass-through-text](../text-document/pass-through.md) - Direct text files
- [metadata-text](../text-document/metadata.md) - Metadata-based text

### Pipeline Utilities

- [select-text](../pipeline-utilities/select-text.md) - First non-empty selection
- [select-longest-text](../pipeline-utilities/select-longest.md) - Choose longest output
- [select-smart-override](../pipeline-utilities/select-smart-override.md) - Media type routing
- [pipeline](../pipeline-utilities/pipeline.md) - Multi-step extraction

## See Also

- [Speech-to-Text Extractors Overview](index.md)
- [Extractors Index](../index.md)
- [extraction.md](../../extraction.md) - Extraction pipeline concepts
- [User Configuration](../../user-configuration.md)
- [OpenAI Whisper API Documentation](https://platform.openai.com/docs/guides/speech-to-text)