Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors
Part of: Python Errors
Quick Answer
How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.
The Error
You load a 70B model and CUDA runs out of memory before the first token:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.Or the tokenizer produces garbage because it doesn’t match the model:
Generated text: "<|im_start|><|im_start|>a a a a a a..."Or you start the OpenAI-compatible server and requests time out:
openai.APIConnectionError: Connection error.
httpx.ConnectTimeout: timed outOr tensor parallelism fails with a mismatched GPU count:
ValueError: Cannot split tensors evenly. tensor_parallel_size=4 but only 2 GPUs availableOr the max context length error hits at inference time:
ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens.vLLM is built around two clever ideas: PagedAttention (efficient KV cache allocation) and continuous batching (new requests join mid-batch). These make it 2–10x faster than naive HuggingFace inference — but they also create failure modes specific to memory allocation and request scheduling that don’t exist in simpler inference engines.
Why This Happens
vLLM’s speed depends on aggressive memory allocation: it reserves as much GPU memory as possible for the KV cache (key-value attention states) ahead of time. This is great for throughput but brittle — if the cache is too small, long prompts fail; if it’s too large, the model itself can’t load. The --gpu-memory-utilization flag (default 0.90) controls this trade-off.
Tokenizer bugs are common because vLLM uses HuggingFace tokenizers and expects them to produce exactly the same token IDs the model was trained on. Mismatched versions (e.g., Llama 3 tokenizer loaded for a Llama 2 model) produce gibberish.
Diagnostic Timeline
When vLLM fails to start or under-performs, the reflex is “I’m OOM, let me try a smaller model.” That is wrong roughly two-thirds of the time. Walk this timeline first.
Minute 0 — Wrong first instinct. You see CUDA out of memory in the log, swap a 70B model for a 13B one, and accept worse quality. Or you cut max_model_len from 32K to 4K and break user requests. Both are bandages. The real fix is usually a configuration mismatch — tensor_parallel_size set wrong, gpu_memory_utilization set too low so the cache is starved, or an attention backend that does not match your hardware.
Minute 1 — Discriminating evidence. Read the first 30 lines of vLLM’s startup log carefully. vLLM prints Tensor parallel size: N, the resolved attention backend (xFormers, FlashAttention, Triton), the KV cache size in tokens, and the chosen gpu_memory_utilization. Three lines tell you almost everything: if KV cache is 0 tokens, the model ate all available memory; if the attention backend is Triton on an A100, you are running the slow path; if Tensor parallel size: 1 on a 2-GPU box, you forgot the flag.
Minute 2 — Next check. Run nvidia-smi while the server is loading. Watch the per-GPU memory rise. With tensor_parallel_size=N, you should see N GPUs filling at roughly the same rate. If only one GPU loads, vLLM did not pick up the TP setting (typo, or CUDA_VISIBLE_DEVICES mask hiding GPUs). If one GPU loads to 100% and another sits at 0%, you have a topology problem (NVLink missing, or NCCL_P2P_DISABLE=1 set somewhere).
Minute 3 — Actual root cause. Three causes account for almost every vLLM startup failure:
tensor_parallel_sizemismatch. You set--tensor-parallel-size 4on a 2-GPU node and get a hard error. Or you set--tensor-parallel-size 1on a 4-GPU node and the model OOMs because it cannot use the other 3. Always match TP to the actual GPU count and confirm it divides the model’s attention head count.gpu_memory_utilizationtoo low. The default is 0.90. Setting 0.50 to “leave room” actually starves the KV cache so you fit only one or two concurrent requests — throughput collapses. The model itself only uses a fixed amount; the rest is cache. Raise utilization toward 0.95 unless you also run other processes on the GPU.- Attention backend mismatch (Triton vs xFormers vs FlashAttention). On Ampere/Hopper, FlashAttention is fastest but requires a recent vLLM build and the right CUDA/torch versions. If vLLM falls back to Triton silently, throughput halves. Force the backend with
--attention-backend FLASH_ATTNand watch for the import error that explains why the fast path is unavailable.
If none of these fit, then look at model size and quantization. By then you actually know whether you are out of room or out of configuration.
Fix 1: CUDA OOM During Model Loading
torch.cuda.OutOfMemoryError: CUDA out of memory.Check whether the model actually fits. Approximate memory per parameter:
| Precision | Bytes/param | 7B model | 70B model |
|---|---|---|---|
| FP32 | 4 | 28 GB | 280 GB |
| FP16 / BF16 | 2 | 14 GB | 140 GB |
| INT8 | 1 | 7 GB | 70 GB |
| INT4 (AWQ / GPTQ) | 0.5 | 3.5 GB | 35 GB |
Add ~30% overhead for activations and KV cache. A 70B FP16 model needs at least 180 GB of GPU memory — doesn’t fit on a single A100 (80GB) without quantization or tensor parallelism.
Reduce gpu_memory_utilization to leave more room for the model itself:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
gpu_memory_utilization=0.80, # Default 0.90 — reduce if model barely fits
max_model_len=4096, # Reduce KV cache size
)Use quantization to fit large models on smaller hardware:
# 4-bit AWQ — best quality for 4-bit
llm = LLM(model="TheBloke/Llama-2-70B-Chat-AWQ")
# GPTQ — another 4-bit option
llm = LLM(model="TheBloke/Llama-2-70B-Chat-GPTQ")
# FP8 (H100, Ada Lovelace architecture)
llm = LLM(
model="neuralmagic/Meta-Llama-3-70B-Instruct-FP8",
quantization="fp8",
)Use tensor parallelism to split across multiple GPUs:
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
tensor_parallel_size=4, # Split across 4 GPUs
)tensor_parallel_size must equal the number of GPUs you have AND divide evenly into the model’s attention head count (usually powers of 2: 1, 2, 4, 8).
Pro Tip: Start with small max_model_len and gpu_memory_utilization=0.85. Once the model loads successfully, increase both until you find the limit. vLLM’s default max_model_len matches the model’s max context — but reducing it frees VRAM for larger batch sizes, which often matters more for throughput than supporting the full context length.
Fix 2: Tokenizer Mismatch — Garbage Output
Prompt: "Hello, how are you?"
Generated: "<|im_start|><|im_start|><|im_start|>a a a a a..."The tokenizer doesn’t match the model. Common causes:
- Loaded a custom tokenizer.json from one model with weights from another.
- Chat template mismatch — Llama 3 expects different special tokens than Llama 2.
- Tokenizer version drift — the tokenizer was updated after the model was trained.
Use the model’s own tokenizer explicitly:
from vllm import LLM
# vLLM loads the tokenizer from the model repo by default
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# If you need a different tokenizer (unusual), specify it
llm = LLM(
model="custom-org/custom-llama-variant",
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct", # Use Llama 3 tokenizer
)Use chat templates instead of raw prompts for instruction-tuned models:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = llm.get_tokenizer()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum entanglement briefly."},
]
# Apply the model's chat template — handles special tokens correctly
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)Common Mistake: Concatenating raw strings like "User: {question}\nAssistant: " for instruction-tuned models. Most modern chat models (Llama 3, Mistral, Qwen) use specific special tokens like <|start_header_id|>user<|end_header_id|>. The chat template generates them correctly; raw concatenation doesn’t, and the model’s output quality drops dramatically.
Confirm the tokenizer being loaded by printing llm.get_tokenizer().__class__.__name__ and the special token IDs after construction — silent mismatches between the model’s expected bos_token_id and what the tokenizer returns are the root cause of most “the chat output is gibberish” bugs.
Fix 3: OpenAI-Compatible API Server
vLLM can run as an OpenAI API drop-in replacement:
# Start the server
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192Use with the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require auth by default
base_url="http://localhost:8000/v1",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
],
temperature=0.7,
max_tokens=100,
)
print(response.choices[0].message.content)Streaming responses:
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a story."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Connection timeouts — the server takes time to load the model:
# Check the server started successfully before sending requests
curl http://localhost:8000/health
# Returns 200 when model is loaded and readySet API key for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--api-key your-secret-keyclient = OpenAI(
api_key="your-secret-key",
base_url="http://vllm-server:8000/v1",
)For OpenAI API client patterns and streaming, see OpenAI API not working.
Fix 4: Tensor Parallelism — GPU Count Matters
ValueError: Cannot split tensors evenly.
tensor_parallel_size=4 but only 2 GPUs availableRules for tensor_parallel_size:
- Must match the number of available GPUs:
nvidia-smishows your count - Must divide the model’s attention head count evenly
- Stick to powers of 2 (1, 2, 4, 8) for best compatibility
# Check GPU count
nvidia-smi -L
# GPU 0: NVIDIA A100-SXM4-80GB
# GPU 1: NVIDIA A100-SXM4-80GB
# → 2 GPUs available
# Set tensor parallel to 2
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 2Pipeline parallelism for models that don’t fit even with tensor parallelism:
llm = LLM(
model="meta-llama/Llama-3-405B-Instruct",
tensor_parallel_size=8,
pipeline_parallel_size=4, # 8*4 = 32 GPUs total
)Uneven GPU assignment for multi-model setups:
# Use specific GPUs, not all visible
CUDA_VISIBLE_DEVICES=0,1 vllm serve model_a --tensor-parallel-size 2 --port 8000
CUDA_VISIBLE_DEVICES=2,3 vllm serve model_b --tensor-parallel-size 2 --port 8001Fix 5: KV Cache — max_model_len and Batching
ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens (7936 in prompt; 256 for generation).vLLM pre-allocates the KV cache based on max_model_len. Requests longer than this are rejected, even if the underlying model supports longer contexts.
Set max_model_len to your actual needs:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 32768 # Supports up to 32K contextThe trade-off: larger max_model_len means fewer concurrent requests fit in GPU memory, because each request’s KV cache is sized for the max length.
# Scenario A: Many short requests
--max-model-len 4096 # Fits more concurrent requests, each up to 4K
# Scenario B: Few long requests
--max-model-len 32768 # Fewer concurrent, each up to 32KEstimate concurrent request capacity:
available_kv_memory = (total_gpu_mem * gpu_memory_utilization) - model_size
kv_per_request = max_model_len * num_layers * hidden_size * 2 * 2 (FP16 bytes)
max_concurrent = available_kv_memory / kv_per_requestFor Llama-3-8B (32 layers, 4096 hidden) at max_model_len=4096 in FP16:
kv_per_request = 4096 * 32 * 4096 * 2 * 2 = 2 GB per request- On 80GB GPU with 16GB model:
64GB / 2GB = ~30 concurrent requests
Prefix caching reuses KV cache across requests with shared prefixes (e.g., system prompts):
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-cachingDramatic speedup for chatbot/agent workloads where every request starts with the same system prompt.
Fix 6: Sampling Parameters and Output Quality
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# Default sampling parameters
sampling_params = SamplingParams(
temperature=0.7, # Higher = more random. 0 = greedy (deterministic)
top_p=0.95, # Nucleus sampling — consider top 95% probability mass
top_k=50, # Top-K sampling (disabled if -1)
max_tokens=256, # Max new tokens to generate
frequency_penalty=0.0, # Penalize frequent tokens
presence_penalty=0.0, # Penalize any repeated token
stop=["<|eot_id|>"], # Stop sequences
seed=42, # Reproducibility
)
outputs = llm.generate(prompts, sampling_params)Deterministic output (for testing, reproducibility):
SamplingParams(
temperature=0.0, # Greedy — always picks the highest probability token
max_tokens=256,
seed=42,
)Output token probabilities for debugging:
SamplingParams(
logprobs=5, # Top 5 logprobs per token
max_tokens=50,
)
outputs = llm.generate(prompts, sampling_params)
for token, logprob in zip(outputs[0].outputs[0].token_ids, outputs[0].outputs[0].logprobs):
print(token, logprob)Fix 7: Model Download and HuggingFace Authentication
OSError: meta-llama/Meta-Llama-3-8B-Instruct is not a local folder and is
not a valid model identifier on Hugging FaceGated models (Llama, Mistral, etc.) require authentication:
# Set token via environment variable
export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3-8B-Instruct
# Or login via HuggingFace CLI
huggingface-cli loginPre-download the model to avoid download delays at server start:
# Download to the default cache
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
# vLLM uses the cached version
vllm serve meta-llama/Meta-Llama-3-8B-InstructCustom model path for local or self-hosted models:
vllm serve /path/to/local/model --served-model-name my-model# Use the served name in API calls
response = client.chat.completions.create(
model="my-model", # Matches --served-model-name
messages=[...],
)Mirror/proxy for restricted networks:
# Set HF_ENDPOINT to your mirror
export HF_ENDPOINT=https://hf-mirror.com
vllm serve meta-llama/Meta-Llama-3-8B-InstructIf huggingface-cli login worked but vllm serve still complains the model is gated, your process is reading a different HOME directory than your shell — check HF_HOME in the systemd unit or Docker env to confirm.
Fix 8: Structured Output and JSON Mode
vLLM supports structured output via the guided_* parameters:
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# JSON schema constraint
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"},
},
"required": ["name", "age"],
}
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=200,
guided_decoding=GuidedDecodingParams(json=schema),
)
outputs = llm.generate(
["Extract user info from: Alice is 30 years old, [email protected]"],
sampling_params,
)
# Guaranteed to output valid JSON matching the schema
print(outputs[0].outputs[0].text)Via the OpenAI API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Generate a user profile as JSON."}],
extra_body={
"guided_json": schema,
},
)Regex-constrained output:
sampling_params = SamplingParams(
guided_decoding=GuidedDecodingParams(
regex=r"\d{4}-\d{2}-\d{2}", # Force a date pattern
),
)Still Not Working?
vLLM vs Other Inference Engines
- vLLM — Best throughput via continuous batching and PagedAttention. Good general-purpose choice.
- Text Generation Inference (TGI) — HuggingFace’s server, similar performance. Better HF ecosystem integration.
- llama.cpp — CPU inference, quantized models. Best for edge/low-resource.
- Ollama — Wrapper over llama.cpp with easy model management. For Ollama-specific patterns, see Ollama not working.
Monitoring and Metrics
vLLM exposes Prometheus metrics:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --disable-log-stats=false
# Metrics at http://localhost:8000/metricsKey metrics:
vllm:num_requests_running— active requestsvllm:num_requests_waiting— queued requestsvllm:gpu_cache_usage_perc— KV cache utilizationvllm:time_to_first_token_seconds— TTFT latencyvllm:e2e_request_latency_seconds— end-to-end latency
Fine-Tuning and LoRA Adapters
Load LoRA adapters at runtime:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--enable-lora \
--lora-modules my-lora=/path/to/lora/adapterresponse = client.chat.completions.create(
model="my-lora", # Use the LoRA adapter
messages=[...],
)LoRA adapters must match the base model’s architecture exactly — loading an adapter trained against Meta-Llama-3-8B onto Meta-Llama-3-8B-Instruct may load without error but produce degraded outputs because the chat-template alignment differs. Train against the variant you serve.
Production Deployment on Kubernetes
vLLM images are available from the official registry:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm", "serve"]
args: ["meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size=1"]
resources:
limits:
nvidia.com/gpu: 1For Kubernetes pod and resource management issues, see Kubernetes OOMKilled.
Throughput Drops After 30 Minutes of Steady Load
You ramp traffic to vLLM, it sustains 200 req/s for half an hour, then collapses to 40 req/s with no errors. Almost always KV cache fragmentation under continuous batching — long-running requests pin cache pages that newly-arriving short requests cannot share. Mitigate with --enable-prefix-caching for shared system prompts, lower max_num_seqs to reduce concurrency, or restart on a schedule. vLLM has improved fragmentation handling in recent releases; upgrade before redesigning your traffic shaping.
Streaming Responses Stall Mid-Generation
The OpenAI-compatible streaming endpoint emits the first chunk fast, then pauses for seconds before the next token. The usual cause is your client buffering — requests and some HTTP/2 proxies aggregate chunks rather than flushing per-token. Use httpx or the official OpenAI Python SDK with streaming enabled, and verify there is no nginx buffering proxy in front (proxy_buffering off for SSE endpoints).
Tokenizer Loads From the Wrong HuggingFace Cache
vLLM caches models under HF_HOME (or ~/.cache/huggingface). On a multi-user GPU server, two users hitting the same model name can race on the lockfile and one ends up with a corrupt download — symptoms include random KeyError on tokenizer config or “tensor shape mismatch” at load. Use HF_HUB_DISABLE_TELEMETRY=1 and per-user HF_HOME directories to avoid the race. For broader HuggingFace cache and token issues that affect vLLM startup, see HuggingFace Transformers not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: TensorFlow Not Working — OOM, Shape Mismatch, GPU Not Found, and Keras Errors
How to fix TensorFlow errors — GPU not detected CUDA library missing, ResourceExhaustedError OOM, InvalidArgumentError shape mismatch, NaN loss, @tf.function AutoGraph failures, and Keras 3 breaking changes in TF 2.16+.
Fix: Outlines Not Working — Backend Setup, Pydantic Schemas, Regex, Choice, and Slow Sampling
How to fix Python Outlines errors — model backend missing, JSON schema vs Pydantic, regex pattern compilation slow, choice list timing, vLLM/Transformers/Ollama wiring, and streaming structured outputs.
Fix: scalene Not Working — Web UI, GPU Profiling, and AI Suggestion Errors
How to fix scalene errors — scalene command not found, web UI port conflict, no GPU detected, profile.json empty, AI optimize requires OpenAI key, native code not attributed, and Jupyter integration.
Fix: CrewAI Not Working — Agent Delegation, Task Context, and LLM Configuration Errors
How to fix CrewAI errors — LLM not configured ValidationError, agent delegation loop, task context not passed between agents, tool output truncated, process hierarchical vs sequential, and memory not persisting across runs.