Skip to content

Fix: OpenAI Whisper Not Working — FFmpeg Missing, GPU Slow, and Language Detection Errors

FixDevs ·

Quick Answer

How to fix Whisper errors — FFmpeg not found audio file load failed, CUDA out of memory on large model, slow CPU transcription, language detected incorrectly, hallucinations on silence, faster-whisper migration, and timestamp accuracy.

The Error

You install Whisper and try to transcribe — Python crashes immediately:

FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'
RuntimeError: Failed to load audio: ffmpeg returned exit code 1

Or you load the large model and get CUDA out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB

Or transcription works but takes 10x real-time on CPU:

# 5 minute audio file → 50 minutes to transcribe
result = model.transcribe("podcast.mp3")

Or Whisper detects the wrong language and produces garbage:

# Japanese audio
result = model.transcribe("japanese.mp3")
print(result['language'])   # 'en' — wrong!
print(result['text'])       # Random English words

Or the model hallucinates text during silence:

[00:23] Hello and welcome to the show
[00:45] Thank you for watching
[01:02] Please subscribe to our channel
# But the actual audio was 30 seconds of silence

Whisper is OpenAI’s open-source transcription model — extremely accurate but resource-hungry. The Python package wraps PyTorch, requires FFmpeg for audio decoding, and has several known issues (hallucinations, language detection on short clips). This guide covers the most common failures.

Why This Happens

Whisper itself is a PyTorch model — but it doesn’t decode audio. Instead, it shells out to ffmpeg to convert any audio format into a 16kHz mono float array. If FFmpeg isn’t installed or isn’t on the PATH, every transcription fails.

The model sizes vary dramatically: tiny is 39M parameters (works on CPU), large-v3 is 1.55B parameters (requires 10GB+ VRAM). Loading the wrong size for your hardware is the most common performance issue. Whisper also has known hallucination patterns on silence and low-quality audio — short repetitive phrases like “Thank you for watching” appear because they’re common in the training data.

Fix 1: FFmpeg Not Found

FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'

Whisper requires FFmpeg as a separate system binary. pip install openai-whisper doesn’t install it.

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Windows:

# Via Chocolatey
choco install ffmpeg

# Or via winget
winget install Gyan.FFmpeg

# Or download manually from ffmpeg.org and add to PATH

Verify FFmpeg is on PATH:

ffmpeg -version
# ffmpeg version 6.1 ...

For Docker images, install FFmpeg in the Dockerfile:

FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

RUN pip install openai-whisper

Alternative: bypass FFmpeg with pre-decoded audio:

import whisper
import numpy as np
import soundfile as sf

# Read audio with soundfile (no FFmpeg needed for WAV/FLAC)
audio, sr = sf.read("audio.wav")

# Convert to Whisper's expected format: 16kHz mono float32
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
if audio.ndim > 1:
    audio = audio.mean(axis=1)   # Stereo to mono
audio = audio.astype(np.float32)

model = whisper.load_model("base")
result = model.transcribe(audio)

Fix 2: Choosing the Right Model Size

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB

You’re loading a model that’s too large for your hardware.

Whisper model sizes:

ModelParametersVRAMSpeed (rel. to large)Use case
tiny39M~1GB32xQuick drafts, English-only
base74M~1GB16xLightweight transcription
small244M~2GB6xGood balance
medium769M~5GB2xHigh quality
large-v31550M~10GB1xBest quality
turbo809M~6GB8xLarge quality, near-medium speed

English-only models (.en suffix) are smaller and faster for English audio:

import whisper

# English-only — faster than the multilingual variant of the same size
model = whisper.load_model("base.en")     # 74M params, English only
model = whisper.load_model("small.en")    # 244M params, English only

Pro Tip: For English transcription, always prefer the .en variants (tiny.en, base.en, small.en, medium.en). They’re trained exclusively on English and produce slightly better results than the multilingual model at the same size — and load faster. The largest models (large-v3, turbo) don’t have .en variants because their multilingual training doesn’t hurt English performance.

Force CPU when GPU is too small:

import whisper

# Load on CPU explicitly
model = whisper.load_model("medium", device="cpu")

# Or load on a specific GPU
model = whisper.load_model("large-v3", device="cuda:1")

Use FP16 to halve VRAM (default on GPU; explicit on CPU is slower):

import whisper

model = whisper.load_model("large-v3")
result = model.transcribe(
    "audio.mp3",
    fp16=True,        # Half precision — saves memory, default on CUDA
)

Fix 3: Slow CPU Transcription — Use faster-whisper

The official openai-whisper package is slow on CPU. The community-built faster-whisper is 4–8x faster using CTranslate2:

pip install faster-whisper
from faster_whisper import WhisperModel

# Faster CPU and GPU transcription
model = WhisperModel("large-v3", device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Language: {info.language} (probability: {info.language_probability})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

compute_type options:

TypeMemorySpeedQuality
float16Lowest GPU memoryFastest GPUSame as FP32
int8_float16Lower GPUFaster GPUSlightly degraded
int8Lowest CPU memoryFastest CPUSlightly degraded
float32HighestSlowestReference

int8 quantization typically loses <1% accuracy but runs 2–4x faster on CPU:

model = WhisperModel("large-v3", device="cpu", compute_type="int8")

For GPU, use float16 or int8_float16:

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

Streaming transcription to process long files without loading everything:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

# segments is a generator — yields as it processes
segments, info = model.transcribe("long_audio.mp3", beam_size=1)

for segment in segments:
    # Process each segment as it's transcribed
    print(f"[{segment.start:.2f}s] {segment.text}", flush=True)

For PyTorch GPU memory issues that affect Whisper model loading, see PyTorch not working.

Fix 4: Wrong Language Detected

result = model.transcribe("japanese.mp3")
print(result['language'])   # 'en' — wrong!

Whisper detects language from the first 30 seconds. If those seconds contain music, silence, or English speech intro, detection fails.

Force the correct language:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "japanese.mp3",
    language="ja",   # Force Japanese
)

Common language codes:

CodeLanguage
enEnglish
jaJapanese
zhChinese
esSpanish
frFrench
deGerman
koKorean
ptPortuguese
ruRussian
arArabic

Detect language from the entire audio (more reliable than the default 30s):

import whisper
import numpy as np

model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")

# Sample multiple chunks and detect language for each
chunk_size = 30 * whisper.audio.SAMPLE_RATE   # 30 second chunks
detected_languages = []

for start in range(0, len(audio), chunk_size):
    chunk = audio[start:start + chunk_size]
    if len(chunk) < whisper.audio.SAMPLE_RATE * 5:   # Skip very short chunks
        continue
    chunk = whisper.pad_or_trim(chunk)
    mel = whisper.log_mel_spectrogram(chunk).to(model.device)
    _, probs = model.detect_language(mel)
    lang = max(probs, key=probs.get)
    detected_languages.append(lang)

# Use the most common detected language
from collections import Counter
most_common = Counter(detected_languages).most_common(1)[0][0]
print(f"Most common language: {most_common}")

Fix 5: Hallucinations on Silence and Low-Quality Audio

Whisper is known to hallucinate during silence. Common phrases that appear in long silences:

  • “Thank you for watching”
  • “Subscribe to my channel”
  • “Music”
  • The same phrase repeated

This is a known training data artifact. Mitigation strategies:

Use Voice Activity Detection (VAD) to skip silence:

pip install faster-whisper
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,                         # Enable VAD
    vad_parameters=dict(
        min_silence_duration_ms=500,         # Skip silences > 500ms
        speech_pad_ms=400,                    # Add 400ms padding around speech
    ),
)

for segment in segments:
    print(segment.text)

VAD pre-filters silence so Whisper never sees it — eliminates most hallucinations.

Reduce hallucinations with sampling parameters:

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=0,                           # Deterministic (no sampling)
    no_speech_threshold=0.6,                 # Higher = more aggressive silence detection
    logprob_threshold=-1.0,                  # Skip segments below this logprob
    compression_ratio_threshold=2.4,         # Skip segments with high repetition
    condition_on_previous_text=False,        # Don't repeat from prior context
)

Common Mistake: Leaving condition_on_previous_text=True (the default) on long audio. When Whisper hallucinates a phrase like “Thank you for watching”, that text becomes context for the next segment — causing repeated hallucinations to compound. Setting condition_on_previous_text=False breaks this loop and dramatically reduces compounding hallucinations on long audio.

Fix 6: Timestamps and Word-Level Accuracy

The default transcribe() returns segment-level timestamps (typically every 5–10 seconds). For word-level timestamps, enable the option:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result['segments']:
    for word in segment['words']:
        print(f"[{word['start']:.2f}s -> {word['end']:.2f}s] {word['word']}")

With faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("base")
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")

Generate SRT subtitles:

import whisper
from datetime import timedelta

def format_timestamp(seconds: float) -> str:
    td = timedelta(seconds=seconds)
    hours, remainder = divmod(td.total_seconds(), 3600)
    minutes, seconds = divmod(remainder, 60)
    millis = int((seconds % 1) * 1000)
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{millis:03d}"

model = whisper.load_model("base")
result = model.transcribe("video.mp4")

with open("subtitles.srt", "w", encoding="utf-8") as f:
    for i, segment in enumerate(result['segments'], start=1):
        f.write(f"{i}\n")
        f.write(f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n")
        f.write(f"{segment['text'].strip()}\n\n")

Fix 7: Chunking Long Audio Files

Whisper processes audio in 30-second chunks internally, but if you have a 2-hour podcast, memory and time become issues. Chunk explicitly for better control:

import whisper
import math

def transcribe_long(model, audio_path, chunk_minutes=10):
    audio = whisper.load_audio(audio_path)
    sample_rate = whisper.audio.SAMPLE_RATE
    chunk_size = chunk_minutes * 60 * sample_rate

    total_chunks = math.ceil(len(audio) / chunk_size)
    full_text = []

    for i in range(total_chunks):
        start = i * chunk_size
        end = min(start + chunk_size, len(audio))
        chunk = audio[start:end]

        print(f"Transcribing chunk {i+1}/{total_chunks}...")
        result = model.transcribe(chunk, language="en", verbose=False)

        # Adjust timestamps to absolute (chunks have local timestamps)
        offset = start / sample_rate
        for segment in result['segments']:
            segment['start'] += offset
            segment['end'] += offset

        full_text.append(result['text'])

    return ' '.join(full_text)

model = whisper.load_model("base")
text = transcribe_long(model, "long_podcast.mp3", chunk_minutes=10)

Fix 8: Initial Prompt for Domain-Specific Vocabulary

Whisper’s accuracy on technical terms, names, and acronyms drops without context. Provide an initial prompt to bias the model:

import whisper

model = whisper.load_model("base")

# WRONG — Whisper transcribes "PyTorch" as "pie torch" or "py-torch"
result = model.transcribe("ml_lecture.mp3")

# CORRECT — initial prompt biases vocabulary
result = model.transcribe(
    "ml_lecture.mp3",
    initial_prompt=(
        "PyTorch, TensorFlow, scikit-learn, GPU, CUDA, transformer, "
        "attention mechanism, gradient descent, backpropagation."
    ),
)

The initial prompt:

  • Doesn’t appear in the output
  • Biases model toward similar terminology
  • Can include speaker names, product names, technical jargon
  • Should be under 200 characters for best effect

For multilingual content with code-switching:

result = model.transcribe(
    "japanese_with_english.mp3",
    language="ja",
    initial_prompt="技術的な内容にはPython、Docker、Kubernetesといった英語の用語が含まれます。",
)

Still Not Working?

OpenAI’s Hosted Whisper API

If running locally is too slow or memory-intensive, use OpenAI’s hosted API:

from openai import OpenAI

client = OpenAI()

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        language="en",
        response_format="verbose_json",   # Includes timestamps and segments
    )

print(transcript.text)

The hosted API has a 25MB file size limit. Split larger files first.

For OpenAI API rate limits, retries, and error handling, see OpenAI API not working.

Audio Preprocessing for Best Quality

Loud background noise, low bitrate audio, and clipping all hurt Whisper’s accuracy. Preprocess with FFmpeg:

# Normalize volume, denoise, convert to 16kHz mono
ffmpeg -i input.mp3 \
    -af "highpass=f=200, lowpass=f=3000, afftdn, dynaudnorm" \
    -ar 16000 -ac 1 \
    output.wav

For OpenCV-style audio waveform display and analysis, NumPy array operations apply — see NumPy not working.

Multi-Speaker Diarization

Whisper transcribes who said what but doesn’t identify speakers. For speaker labels, combine with pyannote-audio:

pip install pyannote.audio
from pyannote.audio import Pipeline
import whisper

# Diarization (who spoke when)
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="HF_TOKEN_HERE",
)
diarization = pipeline("audio.wav")

# Transcription
model = whisper.load_model("base")
result = model.transcribe("audio.wav")

# Combine: assign speaker labels to Whisper segments based on time overlap
# (Implementation depends on your overlap-matching logic)

For HuggingFace token authentication required by pyannote, see HuggingFace Transformers not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles