Fix: OpenAI Whisper Not Working — FFmpeg Missing, GPU Slow, and Language Detection Errors

Q: How do I fix "OpenAI Whisper Not Working — FFmpeg Missing, GPU Slow, and Language Detection Errors"?

How to fix Whisper errors — FFmpeg not found audio file load failed, CUDA out of memory on large model, slow CPU transcription, language detected incorrectly, hallucinations on silence, faster-whisper migration, and timestamp accuracy.

The Error

You install Whisper and try to transcribe — Python crashes immediately:

FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'
RuntimeError: Failed to load audio: ffmpeg returned exit code 1

Or you load the large model and get CUDA out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB

Or transcription works but takes 10x real-time on CPU:

# 5 minute audio file → 50 minutes to transcribe
result = model.transcribe("podcast.mp3")

Or Whisper detects the wrong language and produces garbage:

# Japanese audio
result = model.transcribe("japanese.mp3")
print(result['language'])   # 'en' — wrong!
print(result['text'])       # Random English words

Or the model hallucinates text during silence:

[00:23] Hello and welcome to the show
[00:45] Thank you for watching
[01:02] Please subscribe to our channel
# But the actual audio was 30 seconds of silence

Whisper is OpenAI’s open-source transcription model — extremely accurate but resource-hungry. The Python package wraps PyTorch, requires FFmpeg for audio decoding, and has several known issues (hallucinations, language detection on short clips). This guide covers the most common failures.

Why This Happens

Whisper itself is a PyTorch model — but it doesn’t decode audio. Instead, it shells out to ffmpeg to convert any audio format into a 16kHz mono float array. If FFmpeg isn’t installed or isn’t on the PATH, every transcription fails.

The model sizes vary dramatically: tiny is 39M parameters (works on CPU), large-v3 is 1.55B parameters (requires 10GB+ VRAM). Loading the wrong size for your hardware is the most common performance issue. Whisper also has known hallucination patterns on silence and low-quality audio — short repetitive phrases like “Thank you for watching” appear because they’re common in the training data.

Diagnostic Timeline: “Transcription Is Garbage”

Your first instinct is to switch to large-v3 — bigger model, better output, right? It almost never helps. A garbage transcript from base is usually a garbage transcript from large-v3 too, just slower. Here is the actual triage.

Minute 0 — Listen to the audio Whisper actually sees. Whisper resamples to 16kHz mono. If your input is 44.1kHz stereo with one silent channel, librosa.resample may produce near-silence depending on which channel you average. Save the post-resample audio with soundfile.write("debug.wav", audio, 16000) and play it. If it sounds wrong, your preprocessing is the bug, not the model.

Minute 1 — Check the detected language. Print result["language"] (vanilla Whisper) or info.language and info.language_probability (faster-whisper). If the probability is below 0.6, detection is unreliable — usually because the first 30 seconds contain music, applause, or silence. Force language="ja" (or your real language) and re-run. This single change fixes more “garbage output” reports than any model upgrade.

Minute 3 — Disable condition_on_previous_text for long audio. Whisper conditions each segment on the previous transcription. When the model hallucinates “Thank you for watching” early on, that phrase becomes context and compounds. For audio over 2 minutes, set condition_on_previous_text=False and see if the repetitions disappear.

Minute 5 — Check VAD chunking aggressiveness. With faster-whisper and vad_filter=True, very low min_silence_duration_ms (e.g., 100ms) chops mid-sentence and Whisper hallucinates to fill the gap. Bump to 500-700ms and add 400ms speech_pad_ms. Conversely, if VAD never triggers, you may be passing it audio that is already silence-trimmed elsewhere.

Minute 7 — Inspect compression ratio per segment. result["segments"][i]["compression_ratio"] over ~2.4 means the segment is repetitive (a hallucination loop). Drop those segments or re-transcribe with temperature=0 and compression_ratio_threshold=2.0 (stricter).

Minute 10 — Only then consider a bigger model. If language, VAD, and chunking are all correct and the output is still wrong, the audio is genuinely hard (heavy accent, overlapping speakers, low bitrate) and large-v3 may help. Most of the time, you never get this far.

The first guess (“use a bigger model”) is wrong about three-quarters of the time. Real causes: language detected as English on a 30-second music intro, audio resampled to mono incorrectly, or VAD chunking too aggressive.

Fix 1: FFmpeg Not Found

FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'

Whisper requires FFmpeg as a separate system binary. pip install openai-whisper doesn’t install it.

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Windows:

# Via Chocolatey
choco install ffmpeg

# Or via winget
winget install Gyan.FFmpeg

# Or download manually from ffmpeg.org and add to PATH

Verify FFmpeg is on PATH:

ffmpeg -version
# ffmpeg version 6.1 ...

For Docker images, install FFmpeg in the Dockerfile:

FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

RUN pip install openai-whisper

Alternative: bypass FFmpeg with pre-decoded audio:

import whisper
import numpy as np
import soundfile as sf

# Read audio with soundfile (no FFmpeg needed for WAV/FLAC)
audio, sr = sf.read("audio.wav")

# Convert to Whisper's expected format: 16kHz mono float32
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
if audio.ndim > 1:
    audio = audio.mean(axis=1)   # Stereo to mono
audio = audio.astype(np.float32)

model = whisper.load_model("base")
result = model.transcribe(audio)

Fix 2: Choosing the Right Model Size

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB

You’re loading a model that’s too large for your hardware.

Whisper model sizes:

Model	Parameters	VRAM	Speed (rel. to large)	Use case
`tiny`	39M	~1GB	32x	Quick drafts, English-only
`base`	74M	~1GB	16x	Lightweight transcription
`small`	244M	~2GB	6x	Good balance
`medium`	769M	~5GB	2x	High quality
`large-v3`	1550M	~10GB	1x	Best quality
`turbo`	809M	~6GB	8x	Large quality, near-medium speed

English-only models (.en suffix) are smaller and faster for English audio:

import whisper

# English-only — faster than the multilingual variant of the same size
model = whisper.load_model("base.en")     # 74M params, English only
model = whisper.load_model("small.en")    # 244M params, English only

Pro Tip: For English transcription, always prefer the .en variants (tiny.en, base.en, small.en, medium.en). They’re trained exclusively on English and produce slightly better results than the multilingual model at the same size — and load faster. The largest models (large-v3, turbo) don’t have .en variants because their multilingual training doesn’t hurt English performance.

Force CPU when GPU is too small:

import whisper

# Load on CPU explicitly
model = whisper.load_model("medium", device="cpu")

# Or load on a specific GPU
model = whisper.load_model("large-v3", device="cuda:1")

Use FP16 to halve VRAM (default on GPU; explicit on CPU is slower):

import whisper

model = whisper.load_model("large-v3")
result = model.transcribe(
    "audio.mp3",
    fp16=True,        # Half precision — saves memory, default on CUDA
)

Fix 3: Slow CPU Transcription — Use faster-whisper

The official openai-whisper package is slow on CPU. The community-built faster-whisper is 4–8x faster using CTranslate2:

pip install faster-whisper

from faster_whisper import WhisperModel

# Faster CPU and GPU transcription
model = WhisperModel("large-v3", device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Language: {info.language} (probability: {info.language_probability})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

compute_type options:

Type	Memory	Speed	Quality
`float16`	Lowest GPU memory	Fastest GPU	Same as FP32
`int8_float16`	Lower GPU	Faster GPU	Slightly degraded
`int8`	Lowest CPU memory	Fastest CPU	Slightly degraded
`float32`	Highest	Slowest	Reference

int8 quantization typically loses <1% accuracy but runs 2–4x faster on CPU:

model = WhisperModel("large-v3", device="cpu", compute_type="int8")

For GPU, use float16 or int8_float16:

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

Streaming transcription to process long files without loading everything:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

# segments is a generator — yields as it processes
segments, info = model.transcribe("long_audio.mp3", beam_size=1)

for segment in segments:
    # Process each segment as it's transcribed
    print(f"[{segment.start:.2f}s] {segment.text}", flush=True)

For PyTorch GPU memory issues that affect Whisper model loading, see PyTorch not working.

Fix 4: Wrong Language Detected

result = model.transcribe("japanese.mp3")
print(result['language'])   # 'en' — wrong!

Whisper detects language from the first 30 seconds. If those seconds contain music, silence, or English speech intro, detection fails.

Force the correct language:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "japanese.mp3",
    language="ja",   # Force Japanese
)

Common language codes:

Code	Language
`en`	English
`ja`	Japanese
`zh`	Chinese
`es`	Spanish
`fr`	French
`de`	German
`ko`	Korean
`pt`	Portuguese
`ru`	Russian
`ar`	Arabic

Detect language from the entire audio (more reliable than the default 30s):

import whisper
import numpy as np

model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")

# Sample multiple chunks and detect language for each
chunk_size = 30 * whisper.audio.SAMPLE_RATE   # 30 second chunks
detected_languages = []

for start in range(0, len(audio), chunk_size):
    chunk = audio[start:start + chunk_size]
    if len(chunk) < whisper.audio.SAMPLE_RATE * 5:   # Skip very short chunks
        continue
    chunk = whisper.pad_or_trim(chunk)
    mel = whisper.log_mel_spectrogram(chunk).to(model.device)
    _, probs = model.detect_language(mel)
    lang = max(probs, key=probs.get)
    detected_languages.append(lang)

# Use the most common detected language
from collections import Counter
most_common = Counter(detected_languages).most_common(1)[0][0]
print(f"Most common language: {most_common}")

Fix 5: Hallucinations on Silence and Low-Quality Audio

Whisper is known to hallucinate during silence. Common phrases that appear in long silences:

“Thank you for watching”
“Subscribe to my channel”
“Music”
The same phrase repeated

This is a known training data artifact. Mitigation strategies:

Use Voice Activity Detection (VAD) to skip silence:

pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,                         # Enable VAD
    vad_parameters=dict(
        min_silence_duration_ms=500,         # Skip silences > 500ms
        speech_pad_ms=400,                    # Add 400ms padding around speech
    ),
)

for segment in segments:
    print(segment.text)

VAD pre-filters silence so Whisper never sees it — eliminates most hallucinations.

Reduce hallucinations with sampling parameters:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=0,                           # Deterministic (no sampling)
    no_speech_threshold=0.6,                 # Higher = more aggressive silence detection
    logprob_threshold=-1.0,                  # Skip segments below this logprob
    compression_ratio_threshold=2.4,         # Skip segments with high repetition
    condition_on_previous_text=False,        # Don't repeat from prior context
)

Common Mistake: Leaving condition_on_previous_text=True (the default) on long audio. When Whisper hallucinates a phrase like “Thank you for watching”, that text becomes context for the next segment — causing repeated hallucinations to compound. Setting condition_on_previous_text=False breaks this loop and dramatically reduces compounding hallucinations on long audio.

Fix 6: Timestamps and Word-Level Accuracy

The default transcribe() returns segment-level timestamps (typically every 5–10 seconds). For word-level timestamps, enable the option:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result['segments']:
    for word in segment['words']:
        print(f"[{word['start']:.2f}s -> {word['end']:.2f}s] {word['word']}")

With faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("base")
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")

Generate SRT subtitles:

import whisper
from datetime import timedelta

def format_timestamp(seconds: float) -> str:
    td = timedelta(seconds=seconds)
    hours, remainder = divmod(td.total_seconds(), 3600)
    minutes, seconds = divmod(remainder, 60)
    millis = int((seconds % 1) * 1000)
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{millis:03d}"

model = whisper.load_model("base")
result = model.transcribe("video.mp4")

with open("subtitles.srt", "w", encoding="utf-8") as f:
    for i, segment in enumerate(result['segments'], start=1):
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n")
        f.write(f"{segment['text'].strip()}\n\n")

Fix 7: Chunking Long Audio Files

Whisper processes audio in 30-second chunks internally, but if you have a 2-hour podcast, memory and time become issues. Chunk explicitly for better control:

import whisper
import math

def transcribe_long(model, audio_path, chunk_minutes=10):
    audio = whisper.load_audio(audio_path)
    sample_rate = whisper.audio.SAMPLE_RATE
    chunk_size = chunk_minutes * 60 * sample_rate

    total_chunks = math.ceil(len(audio) / chunk_size)
    full_text = []

    for i in range(total_chunks):
        start = i * chunk_size
        end = min(start + chunk_size, len(audio))
        chunk = audio[start:end]

        print(f"Transcribing chunk {i+1}/{total_chunks}...")
        result = model.transcribe(chunk, language="en", verbose=False)

        # Adjust timestamps to absolute (chunks have local timestamps)
        offset = start / sample_rate
        for segment in result['segments']:
            segment['start'] += offset
            segment['end'] += offset

        full_text.append(result['text'])

    return ' '.join(full_text)

model = whisper.load_model("base")
text = transcribe_long(model, "long_podcast.mp3", chunk_minutes=10)

Fix 8: Initial Prompt for Domain-Specific Vocabulary

Whisper’s accuracy on technical terms, names, and acronyms drops without context. Provide an initial prompt to bias the model:

import whisper

model = whisper.load_model("base")

# WRONG — Whisper transcribes "PyTorch" as "pie torch" or "py-torch"
result = model.transcribe("ml_lecture.mp3")

# CORRECT — initial prompt biases vocabulary
result = model.transcribe(
    "ml_lecture.mp3",
    initial_prompt=(
        "PyTorch, TensorFlow, scikit-learn, GPU, CUDA, transformer, "
        "attention mechanism, gradient descent, backpropagation."
    ),
)

The initial prompt:

Doesn’t appear in the output
Biases model toward similar terminology
Can include speaker names, product names, technical jargon
Should be under 200 characters for best effect

For multilingual content with code-switching:

result = model.transcribe(
    "japanese_with_english.mp3",
    language="ja",
    initial_prompt="技術的な内容にはPython、Docker、Kubernetesといった英語の用語が含まれます。",
)

Still Not Working?

OpenAI’s Hosted Whisper API

If running locally is too slow or memory-intensive, use OpenAI’s hosted API:

from openai import OpenAI

client = OpenAI()

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        language="en",
        response_format="verbose_json",   # Includes timestamps and segments
    )

print(transcript.text)

The hosted API has a 25MB file size limit. Split larger files first.

For OpenAI API rate limits, retries, and error handling, see OpenAI API not working.

Audio Preprocessing for Best Quality

Loud background noise, low bitrate audio, and clipping all hurt Whisper’s accuracy. Preprocess with FFmpeg:

# Normalize volume, denoise, convert to 16kHz mono
ffmpeg -i input.mp3 \
    -af "highpass=f=200, lowpass=f=3000, afftdn, dynaudnorm" \
    -ar 16000 -ac 1 \
    output.wav

For OpenCV-style audio waveform display and analysis, NumPy array operations apply — see NumPy not working.

Multi-Speaker Diarization

Whisper transcribes who said what but doesn’t identify speakers. For speaker labels, combine with pyannote-audio:

pip install pyannote.audio

from pyannote.audio import Pipeline
import whisper

# Diarization (who spoke when)
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="HF_TOKEN_HERE",
)
diarization = pipeline("audio.wav")

# Transcription
model = whisper.load_model("base")
result = model.transcribe("audio.wav")

# Combine: assign speaker labels to Whisper segments based on time overlap
# (Implementation depends on your overlap-matching logic)

For HuggingFace token authentication required by pyannote, see HuggingFace Transformers not working.

Audio Resampling Silently Dropping Channels

A common silent failure: your audio is stereo with the speaker on one channel and music on the other. Mean-averaging the channels (audio.mean(axis=1)) buries the speaker under the music. Whisper sees mostly music and either hallucinates lyrics or detects English on a non-English track. If you know which channel has speech, select it explicitly (audio[:, 0] or audio[:, 1]). If you do not, run a quick energy comparison and pick the louder channel — or run Whisper on both and compare compression_ratio and avg_logprob to pick the winner.

VAD Chunking Cutting Mid-Sentence

faster-whisper’s vad_filter=True uses Silero VAD by default. With aggressive defaults, it splits on every 200ms pause — and Whisper transcribes each chunk in isolation. Words get cut mid-syllable, names get mangled across chunk boundaries, and the model hallucinates filler to round out partial words. Tune min_silence_duration_ms=700 and speech_pad_ms=400 so each chunk is a complete phrase plus context. For dense conversational audio, sometimes disabling VAD entirely and letting Whisper’s internal segmentation handle it gives better results.

Initial Prompt Bleeding Into Output

The initial_prompt is supposed to bias vocabulary without appearing in the transcript. In practice, large-v3 sometimes parrots prompt phrases at the start of the first segment — especially if the prompt is long or contains complete sentences. Keep prompts to a comma-separated list of jargon under 200 characters: "PyTorch, CUDA, transformer, attention, backpropagation." is safe. Full sentences like "This is a lecture about machine learning." get echoed back as the first transcribed line.

Long-File Memory Pressure on `whisper.transcribe()`

The official openai-whisper package loads the entire audio array into memory before chunking. A 4-hour podcast at 16kHz mono is ~460MB as float32 — and PyTorch holds another copy on the GPU. For long files, either chunk manually with whisper.load_audio() slicing (see Fix 7) or switch to faster-whisper, which streams chunks lazily and never holds the full audio in memory.

Fix: OpenAI Whisper Not Working — FFmpeg Missing, GPU Slow, and Language Detection Errors

The Error

Why This Happens

Diagnostic Timeline: “Transcription Is Garbage”

Fix 1: FFmpeg Not Found

Fix 2: Choosing the Right Model Size

Fix 3: Slow CPU Transcription — Use faster-whisper

Fix 4: Wrong Language Detected

Fix 5: Hallucinations on Silence and Low-Quality Audio

Fix 6: Timestamps and Word-Level Accuracy

Fix 7: Chunking Long Audio Files

Fix 8: Initial Prompt for Domain-Specific Vocabulary

Still Not Working?

OpenAI’s Hosted Whisper API

Audio Preprocessing for Best Quality

Multi-Speaker Diarization

Audio Resampling Silently Dropping Channels

VAD Chunking Cutting Mid-Sentence

Initial Prompt Bleeding Into Output

Long-File Memory Pressure on `whisper.transcribe()`

Related Articles

Fix: OpenAI API Not Working — RateLimitError, 401, 429, and Connection Issues

Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises

Fix: Instructor Not Working — Validation Loops, Mode Mismatch, Streaming, and Anthropic / Gemini Issues

Fix: joblib Not Working — Parallel Backends, Memory Cache, and Pickling Errors

The Error

Why This Happens

Diagnostic Timeline: “Transcription Is Garbage”

Fix 1: FFmpeg Not Found

Fix 2: Choosing the Right Model Size

Fix 3: Slow CPU Transcription — Use faster-whisper

Fix 4: Wrong Language Detected

Fix 5: Hallucinations on Silence and Low-Quality Audio

Fix 6: Timestamps and Word-Level Accuracy

Fix 7: Chunking Long Audio Files

Fix 8: Initial Prompt for Domain-Specific Vocabulary

Still Not Working?

OpenAI’s Hosted Whisper API

Audio Preprocessing for Best Quality

Multi-Speaker Diarization

Audio Resampling Silently Dropping Channels

VAD Chunking Cutting Mid-Sentence

Initial Prompt Bleeding Into Output

Long-File Memory Pressure on whisper.transcribe()

Related Articles

Fix: OpenAI API Not Working — RateLimitError, 401, 429, and Connection Issues

Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises

Fix: Instructor Not Working — Validation Loops, Mode Mismatch, Streaming, and Anthropic / Gemini Issues

Fix: joblib Not Working — Parallel Backends, Memory Cache, and Pickling Errors

Long-File Memory Pressure on `whisper.transcribe()`