Fix: OpenAI Whisper Not Working — FFmpeg Missing, GPU Slow, and Language Detection Errors
Quick Answer
How to fix Whisper errors — FFmpeg not found audio file load failed, CUDA out of memory on large model, slow CPU transcription, language detected incorrectly, hallucinations on silence, faster-whisper migration, and timestamp accuracy.
The Error
You install Whisper and try to transcribe — Python crashes immediately:
FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'
RuntimeError: Failed to load audio: ffmpeg returned exit code 1Or you load the large model and get CUDA out of memory:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiBOr transcription works but takes 10x real-time on CPU:
# 5 minute audio file → 50 minutes to transcribe
result = model.transcribe("podcast.mp3")Or Whisper detects the wrong language and produces garbage:
# Japanese audio
result = model.transcribe("japanese.mp3")
print(result['language']) # 'en' — wrong!
print(result['text']) # Random English wordsOr the model hallucinates text during silence:
[00:23] Hello and welcome to the show
[00:45] Thank you for watching
[01:02] Please subscribe to our channel
# But the actual audio was 30 seconds of silenceWhisper is OpenAI’s open-source transcription model — extremely accurate but resource-hungry. The Python package wraps PyTorch, requires FFmpeg for audio decoding, and has several known issues (hallucinations, language detection on short clips). This guide covers the most common failures.
Why This Happens
Whisper itself is a PyTorch model — but it doesn’t decode audio. Instead, it shells out to ffmpeg to convert any audio format into a 16kHz mono float array. If FFmpeg isn’t installed or isn’t on the PATH, every transcription fails.
The model sizes vary dramatically: tiny is 39M parameters (works on CPU), large-v3 is 1.55B parameters (requires 10GB+ VRAM). Loading the wrong size for your hardware is the most common performance issue. Whisper also has known hallucination patterns on silence and low-quality audio — short repetitive phrases like “Thank you for watching” appear because they’re common in the training data.
Fix 1: FFmpeg Not Found
FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg'Whisper requires FFmpeg as a separate system binary. pip install openai-whisper doesn’t install it.
macOS:
brew install ffmpegUbuntu/Debian:
sudo apt update && sudo apt install ffmpegWindows:
# Via Chocolatey
choco install ffmpeg
# Or via winget
winget install Gyan.FFmpeg
# Or download manually from ffmpeg.org and add to PATHVerify FFmpeg is on PATH:
ffmpeg -version
# ffmpeg version 6.1 ...For Docker images, install FFmpeg in the Dockerfile:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
RUN pip install openai-whisperAlternative: bypass FFmpeg with pre-decoded audio:
import whisper
import numpy as np
import soundfile as sf
# Read audio with soundfile (no FFmpeg needed for WAV/FLAC)
audio, sr = sf.read("audio.wav")
# Convert to Whisper's expected format: 16kHz mono float32
if sr != 16000:
import librosa
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
if audio.ndim > 1:
audio = audio.mean(axis=1) # Stereo to mono
audio = audio.astype(np.float32)
model = whisper.load_model("base")
result = model.transcribe(audio)Fix 2: Choosing the Right Model Size
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiBYou’re loading a model that’s too large for your hardware.
Whisper model sizes:
| Model | Parameters | VRAM | Speed (rel. to large) | Use case |
|---|---|---|---|---|
tiny | 39M | ~1GB | 32x | Quick drafts, English-only |
base | 74M | ~1GB | 16x | Lightweight transcription |
small | 244M | ~2GB | 6x | Good balance |
medium | 769M | ~5GB | 2x | High quality |
large-v3 | 1550M | ~10GB | 1x | Best quality |
turbo | 809M | ~6GB | 8x | Large quality, near-medium speed |
English-only models (.en suffix) are smaller and faster for English audio:
import whisper
# English-only — faster than the multilingual variant of the same size
model = whisper.load_model("base.en") # 74M params, English only
model = whisper.load_model("small.en") # 244M params, English onlyPro Tip: For English transcription, always prefer the .en variants (tiny.en, base.en, small.en, medium.en). They’re trained exclusively on English and produce slightly better results than the multilingual model at the same size — and load faster. The largest models (large-v3, turbo) don’t have .en variants because their multilingual training doesn’t hurt English performance.
Force CPU when GPU is too small:
import whisper
# Load on CPU explicitly
model = whisper.load_model("medium", device="cpu")
# Or load on a specific GPU
model = whisper.load_model("large-v3", device="cuda:1")Use FP16 to halve VRAM (default on GPU; explicit on CPU is slower):
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe(
"audio.mp3",
fp16=True, # Half precision — saves memory, default on CUDA
)Fix 3: Slow CPU Transcription — Use faster-whisper
The official openai-whisper package is slow on CPU. The community-built faster-whisper is 4–8x faster using CTranslate2:
pip install faster-whisperfrom faster_whisper import WhisperModel
# Faster CPU and GPU transcription
model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Language: {info.language} (probability: {info.language_probability})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")compute_type options:
| Type | Memory | Speed | Quality |
|---|---|---|---|
float16 | Lowest GPU memory | Fastest GPU | Same as FP32 |
int8_float16 | Lower GPU | Faster GPU | Slightly degraded |
int8 | Lowest CPU memory | Fastest CPU | Slightly degraded |
float32 | Highest | Slowest | Reference |
int8 quantization typically loses <1% accuracy but runs 2–4x faster on CPU:
model = WhisperModel("large-v3", device="cpu", compute_type="int8")For GPU, use float16 or int8_float16:
model = WhisperModel("large-v3", device="cuda", compute_type="float16")Streaming transcription to process long files without loading everything:
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
# segments is a generator — yields as it processes
segments, info = model.transcribe("long_audio.mp3", beam_size=1)
for segment in segments:
# Process each segment as it's transcribed
print(f"[{segment.start:.2f}s] {segment.text}", flush=True)For PyTorch GPU memory issues that affect Whisper model loading, see PyTorch not working.
Fix 4: Wrong Language Detected
result = model.transcribe("japanese.mp3")
print(result['language']) # 'en' — wrong!Whisper detects language from the first 30 seconds. If those seconds contain music, silence, or English speech intro, detection fails.
Force the correct language:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"japanese.mp3",
language="ja", # Force Japanese
)Common language codes:
| Code | Language |
|---|---|
en | English |
ja | Japanese |
zh | Chinese |
es | Spanish |
fr | French |
de | German |
ko | Korean |
pt | Portuguese |
ru | Russian |
ar | Arabic |
Detect language from the entire audio (more reliable than the default 30s):
import whisper
import numpy as np
model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
# Sample multiple chunks and detect language for each
chunk_size = 30 * whisper.audio.SAMPLE_RATE # 30 second chunks
detected_languages = []
for start in range(0, len(audio), chunk_size):
chunk = audio[start:start + chunk_size]
if len(chunk) < whisper.audio.SAMPLE_RATE * 5: # Skip very short chunks
continue
chunk = whisper.pad_or_trim(chunk)
mel = whisper.log_mel_spectrogram(chunk).to(model.device)
_, probs = model.detect_language(mel)
lang = max(probs, key=probs.get)
detected_languages.append(lang)
# Use the most common detected language
from collections import Counter
most_common = Counter(detected_languages).most_common(1)[0][0]
print(f"Most common language: {most_common}")Fix 5: Hallucinations on Silence and Low-Quality Audio
Whisper is known to hallucinate during silence. Common phrases that appear in long silences:
- “Thank you for watching”
- “Subscribe to my channel”
- “Music”
- The same phrase repeated
This is a known training data artifact. Mitigation strategies:
Use Voice Activity Detection (VAD) to skip silence:
pip install faster-whisperfrom faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe(
"audio.mp3",
vad_filter=True, # Enable VAD
vad_parameters=dict(
min_silence_duration_ms=500, # Skip silences > 500ms
speech_pad_ms=400, # Add 400ms padding around speech
),
)
for segment in segments:
print(segment.text)VAD pre-filters silence so Whisper never sees it — eliminates most hallucinations.
Reduce hallucinations with sampling parameters:
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"audio.mp3",
temperature=0, # Deterministic (no sampling)
no_speech_threshold=0.6, # Higher = more aggressive silence detection
logprob_threshold=-1.0, # Skip segments below this logprob
compression_ratio_threshold=2.4, # Skip segments with high repetition
condition_on_previous_text=False, # Don't repeat from prior context
)Common Mistake: Leaving condition_on_previous_text=True (the default) on long audio. When Whisper hallucinates a phrase like “Thank you for watching”, that text becomes context for the next segment — causing repeated hallucinations to compound. Setting condition_on_previous_text=False breaks this loop and dramatically reduces compounding hallucinations on long audio.
Fix 6: Timestamps and Word-Level Accuracy
The default transcribe() returns segment-level timestamps (typically every 5–10 seconds). For word-level timestamps, enable the option:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result['segments']:
for word in segment['words']:
print(f"[{word['start']:.2f}s -> {word['end']:.2f}s] {word['word']}")With faster-whisper:
from faster_whisper import WhisperModel
model = WhisperModel("base")
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")Generate SRT subtitles:
import whisper
from datetime import timedelta
def format_timestamp(seconds: float) -> str:
td = timedelta(seconds=seconds)
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
millis = int((seconds % 1) * 1000)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{millis:03d}"
model = whisper.load_model("base")
result = model.transcribe("video.mp4")
with open("subtitles.srt", "w", encoding="utf-8") as f:
for i, segment in enumerate(result['segments'], start=1):
f.write(f"{i}\n")
f.write(f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n")
f.write(f"{segment['text'].strip()}\n\n")Fix 7: Chunking Long Audio Files
Whisper processes audio in 30-second chunks internally, but if you have a 2-hour podcast, memory and time become issues. Chunk explicitly for better control:
import whisper
import math
def transcribe_long(model, audio_path, chunk_minutes=10):
audio = whisper.load_audio(audio_path)
sample_rate = whisper.audio.SAMPLE_RATE
chunk_size = chunk_minutes * 60 * sample_rate
total_chunks = math.ceil(len(audio) / chunk_size)
full_text = []
for i in range(total_chunks):
start = i * chunk_size
end = min(start + chunk_size, len(audio))
chunk = audio[start:end]
print(f"Transcribing chunk {i+1}/{total_chunks}...")
result = model.transcribe(chunk, language="en", verbose=False)
# Adjust timestamps to absolute (chunks have local timestamps)
offset = start / sample_rate
for segment in result['segments']:
segment['start'] += offset
segment['end'] += offset
full_text.append(result['text'])
return ' '.join(full_text)
model = whisper.load_model("base")
text = transcribe_long(model, "long_podcast.mp3", chunk_minutes=10)Fix 8: Initial Prompt for Domain-Specific Vocabulary
Whisper’s accuracy on technical terms, names, and acronyms drops without context. Provide an initial prompt to bias the model:
import whisper
model = whisper.load_model("base")
# WRONG — Whisper transcribes "PyTorch" as "pie torch" or "py-torch"
result = model.transcribe("ml_lecture.mp3")
# CORRECT — initial prompt biases vocabulary
result = model.transcribe(
"ml_lecture.mp3",
initial_prompt=(
"PyTorch, TensorFlow, scikit-learn, GPU, CUDA, transformer, "
"attention mechanism, gradient descent, backpropagation."
),
)The initial prompt:
- Doesn’t appear in the output
- Biases model toward similar terminology
- Can include speaker names, product names, technical jargon
- Should be under 200 characters for best effect
For multilingual content with code-switching:
result = model.transcribe(
"japanese_with_english.mp3",
language="ja",
initial_prompt="技術的な内容にはPython、Docker、Kubernetesといった英語の用語が含まれます。",
)Still Not Working?
OpenAI’s Hosted Whisper API
If running locally is too slow or memory-intensive, use OpenAI’s hosted API:
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language="en",
response_format="verbose_json", # Includes timestamps and segments
)
print(transcript.text)The hosted API has a 25MB file size limit. Split larger files first.
For OpenAI API rate limits, retries, and error handling, see OpenAI API not working.
Audio Preprocessing for Best Quality
Loud background noise, low bitrate audio, and clipping all hurt Whisper’s accuracy. Preprocess with FFmpeg:
# Normalize volume, denoise, convert to 16kHz mono
ffmpeg -i input.mp3 \
-af "highpass=f=200, lowpass=f=3000, afftdn, dynaudnorm" \
-ar 16000 -ac 1 \
output.wavFor OpenCV-style audio waveform display and analysis, NumPy array operations apply — see NumPy not working.
Multi-Speaker Diarization
Whisper transcribes who said what but doesn’t identify speakers. For speaker labels, combine with pyannote-audio:
pip install pyannote.audiofrom pyannote.audio import Pipeline
import whisper
# Diarization (who spoke when)
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HF_TOKEN_HERE",
)
diarization = pipeline("audio.wav")
# Transcription
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
# Combine: assign speaker labels to Whisper segments based on time overlap
# (Implementation depends on your overlap-matching logic)For HuggingFace token authentication required by pyannote, see HuggingFace Transformers not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: OpenAI API Not Working — RateLimitError, 401, 429, and Connection Issues
How to fix OpenAI API errors — RateLimitError (429), AuthenticationError (401), APIConnectionError, context length exceeded, model not found, and SDK v0-to-v1 migration mistakes.
Fix: Apache Airflow Not Working — DAG Not Found, Task Failures, and Scheduler Issues
How to fix Apache Airflow errors — DAG not appearing in UI, ImportError preventing DAG load, task stuck in running or queued, scheduler not scheduling, XCom too large, connection not found, and database migration errors.
Fix: BeautifulSoup Not Working — Parser Errors, Encoding Issues, and find_all Returns Empty
How to fix BeautifulSoup errors — bs4.FeatureNotFound install lxml, find_all returns empty list, Unicode decode error, JavaScript-rendered content not found, select vs find_all confusion, and slow parsing on large HTML.
Fix: Dash Not Working — Callback Errors, Pattern Matching, and State Management
How to fix Dash errors — circular dependency in callbacks, pattern matching callback not firing, missing attribute clientside_callback, DataTable filtering not working, clientside JavaScript errors, Input Output State confusion, and async callback delays.