Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors
Quick Answer
How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.
The Error
You load a 70B model and CUDA runs out of memory before the first token:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.Or the tokenizer produces garbage because it doesn’t match the model:
Generated text: "<|im_start|><|im_start|>a a a a a a..."Or you start the OpenAI-compatible server and requests time out:
openai.APIConnectionError: Connection error.
httpx.ConnectTimeout: timed outOr tensor parallelism fails with a mismatched GPU count:
ValueError: Cannot split tensors evenly. tensor_parallel_size=4 but only 2 GPUs availableOr the max context length error hits at inference time:
ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens.vLLM is built around two clever ideas: PagedAttention (efficient KV cache allocation) and continuous batching (new requests join mid-batch). These make it 2–10x faster than naive HuggingFace inference — but they also create failure modes specific to memory allocation and request scheduling that don’t exist in simpler inference engines.
Why This Happens
vLLM’s speed depends on aggressive memory allocation: it reserves as much GPU memory as possible for the KV cache (key-value attention states) ahead of time. This is great for throughput but brittle — if the cache is too small, long prompts fail; if it’s too large, the model itself can’t load. The --gpu-memory-utilization flag (default 0.90) controls this trade-off.
Tokenizer bugs are common because vLLM uses HuggingFace tokenizers and expects them to produce exactly the same token IDs the model was trained on. Mismatched versions (e.g., Llama 3 tokenizer loaded for a Llama 2 model) produce gibberish.
Fix 1: CUDA OOM During Model Loading
torch.cuda.OutOfMemoryError: CUDA out of memory.Check whether the model actually fits. Approximate memory per parameter:
| Precision | Bytes/param | 7B model | 70B model |
|---|---|---|---|
| FP32 | 4 | 28 GB | 280 GB |
| FP16 / BF16 | 2 | 14 GB | 140 GB |
| INT8 | 1 | 7 GB | 70 GB |
| INT4 (AWQ / GPTQ) | 0.5 | 3.5 GB | 35 GB |
Add ~30% overhead for activations and KV cache. A 70B FP16 model needs at least 180 GB of GPU memory — doesn’t fit on a single A100 (80GB) without quantization or tensor parallelism.
Reduce gpu_memory_utilization to leave more room for the model itself:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
gpu_memory_utilization=0.80, # Default 0.90 — reduce if model barely fits
max_model_len=4096, # Reduce KV cache size
)Use quantization to fit large models on smaller hardware:
# 4-bit AWQ — best quality for 4-bit
llm = LLM(model="TheBloke/Llama-2-70B-Chat-AWQ")
# GPTQ — another 4-bit option
llm = LLM(model="TheBloke/Llama-2-70B-Chat-GPTQ")
# FP8 (H100, Ada Lovelace architecture)
llm = LLM(
model="neuralmagic/Meta-Llama-3-70B-Instruct-FP8",
quantization="fp8",
)Use tensor parallelism to split across multiple GPUs:
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
tensor_parallel_size=4, # Split across 4 GPUs
)tensor_parallel_size must equal the number of GPUs you have AND divide evenly into the model’s attention head count (usually powers of 2: 1, 2, 4, 8).
Pro Tip: Start with small max_model_len and gpu_memory_utilization=0.85. Once the model loads successfully, increase both until you find the limit. vLLM’s default max_model_len matches the model’s max context — but reducing it frees VRAM for larger batch sizes, which often matters more for throughput than supporting the full context length.
Fix 2: Tokenizer Mismatch — Garbage Output
Prompt: "Hello, how are you?"
Generated: "<|im_start|><|im_start|><|im_start|>a a a a a..."The tokenizer doesn’t match the model. Common causes:
- Loaded a custom tokenizer.json from one model with weights from another.
- Chat template mismatch — Llama 3 expects different special tokens than Llama 2.
- Tokenizer version drift — the tokenizer was updated after the model was trained.
Use the model’s own tokenizer explicitly:
from vllm import LLM
# vLLM loads the tokenizer from the model repo by default
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# If you need a different tokenizer (unusual), specify it
llm = LLM(
model="custom-org/custom-llama-variant",
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct", # Use Llama 3 tokenizer
)Use chat templates instead of raw prompts for instruction-tuned models:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = llm.get_tokenizer()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum entanglement briefly."},
]
# Apply the model's chat template — handles special tokens correctly
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)Common Mistake: Concatenating raw strings like "User: {question}\nAssistant: " for instruction-tuned models. Most modern chat models (Llama 3, Mistral, Qwen) use specific special tokens like <|start_header_id|>user<|end_header_id|>. The chat template generates them correctly; raw concatenation doesn’t, and the model’s output quality drops dramatically.
For HuggingFace tokenizer patterns and chat templates, see HuggingFace Transformers not working.
Fix 3: OpenAI-Compatible API Server
vLLM can run as an OpenAI API drop-in replacement:
# Start the server
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192Use with the OpenAI Python client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require auth by default
base_url="http://localhost:8000/v1",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
],
temperature=0.7,
max_tokens=100,
)
print(response.choices[0].message.content)Streaming responses:
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a story."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Connection timeouts — the server takes time to load the model:
# Check the server started successfully before sending requests
curl http://localhost:8000/health
# Returns 200 when model is loaded and readySet API key for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--api-key your-secret-keyclient = OpenAI(
api_key="your-secret-key",
base_url="http://vllm-server:8000/v1",
)For OpenAI API client patterns and streaming, see OpenAI API not working.
Fix 4: Tensor Parallelism — GPU Count Matters
ValueError: Cannot split tensors evenly.
tensor_parallel_size=4 but only 2 GPUs availableRules for tensor_parallel_size:
- Must match the number of available GPUs:
nvidia-smishows your count - Must divide the model’s attention head count evenly
- Stick to powers of 2 (1, 2, 4, 8) for best compatibility
# Check GPU count
nvidia-smi -L
# GPU 0: NVIDIA A100-SXM4-80GB
# GPU 1: NVIDIA A100-SXM4-80GB
# → 2 GPUs available
# Set tensor parallel to 2
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 2Pipeline parallelism for models that don’t fit even with tensor parallelism:
llm = LLM(
model="meta-llama/Llama-3-405B-Instruct",
tensor_parallel_size=8,
pipeline_parallel_size=4, # 8*4 = 32 GPUs total
)Uneven GPU assignment for multi-model setups:
# Use specific GPUs, not all visible
CUDA_VISIBLE_DEVICES=0,1 vllm serve model_a --tensor-parallel-size 2 --port 8000
CUDA_VISIBLE_DEVICES=2,3 vllm serve model_b --tensor-parallel-size 2 --port 8001Fix 5: KV Cache — max_model_len and Batching
ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens (7936 in prompt; 256 for generation).vLLM pre-allocates the KV cache based on max_model_len. Requests longer than this are rejected, even if the underlying model supports longer contexts.
Set max_model_len to your actual needs:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 32768 # Supports up to 32K contextThe trade-off: larger max_model_len means fewer concurrent requests fit in GPU memory, because each request’s KV cache is sized for the max length.
# Scenario A: Many short requests
--max-model-len 4096 # Fits more concurrent requests, each up to 4K
# Scenario B: Few long requests
--max-model-len 32768 # Fewer concurrent, each up to 32KEstimate concurrent request capacity:
available_kv_memory = (total_gpu_mem * gpu_memory_utilization) - model_size
kv_per_request = max_model_len * num_layers * hidden_size * 2 * 2 (FP16 bytes)
max_concurrent = available_kv_memory / kv_per_requestFor Llama-3-8B (32 layers, 4096 hidden) at max_model_len=4096 in FP16:
kv_per_request = 4096 * 32 * 4096 * 2 * 2 = 2 GB per request- On 80GB GPU with 16GB model:
64GB / 2GB = ~30 concurrent requests
Prefix caching reuses KV cache across requests with shared prefixes (e.g., system prompts):
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-cachingDramatic speedup for chatbot/agent workloads where every request starts with the same system prompt.
Fix 6: Sampling Parameters and Output Quality
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# Default sampling parameters
sampling_params = SamplingParams(
temperature=0.7, # Higher = more random. 0 = greedy (deterministic)
top_p=0.95, # Nucleus sampling — consider top 95% probability mass
top_k=50, # Top-K sampling (disabled if -1)
max_tokens=256, # Max new tokens to generate
frequency_penalty=0.0, # Penalize frequent tokens
presence_penalty=0.0, # Penalize any repeated token
stop=["<|eot_id|>"], # Stop sequences
seed=42, # Reproducibility
)
outputs = llm.generate(prompts, sampling_params)Deterministic output (for testing, reproducibility):
SamplingParams(
temperature=0.0, # Greedy — always picks the highest probability token
max_tokens=256,
seed=42,
)Output token probabilities for debugging:
SamplingParams(
logprobs=5, # Top 5 logprobs per token
max_tokens=50,
)
outputs = llm.generate(prompts, sampling_params)
for token, logprob in zip(outputs[0].outputs[0].token_ids, outputs[0].outputs[0].logprobs):
print(token, logprob)Fix 7: Model Download and HuggingFace Authentication
OSError: meta-llama/Meta-Llama-3-8B-Instruct is not a local folder and is
not a valid model identifier on Hugging FaceGated models (Llama, Mistral, etc.) require authentication:
# Set token via environment variable
export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3-8B-Instruct
# Or login via HuggingFace CLI
huggingface-cli loginPre-download the model to avoid download delays at server start:
# Download to the default cache
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
# vLLM uses the cached version
vllm serve meta-llama/Meta-Llama-3-8B-InstructCustom model path for local or self-hosted models:
vllm serve /path/to/local/model --served-model-name my-model# Use the served name in API calls
response = client.chat.completions.create(
model="my-model", # Matches --served-model-name
messages=[...],
)Mirror/proxy for restricted networks:
# Set HF_ENDPOINT to your mirror
export HF_ENDPOINT=https://hf-mirror.com
vllm serve meta-llama/Meta-Llama-3-8B-InstructFor HuggingFace authentication patterns and HF_TOKEN setup, see HuggingFace Transformers not working.
Fix 8: Structured Output and JSON Mode
vLLM supports structured output via the guided_* parameters:
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
# JSON schema constraint
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"},
},
"required": ["name", "age"],
}
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=200,
guided_decoding=GuidedDecodingParams(json=schema),
)
outputs = llm.generate(
["Extract user info from: Alice is 30 years old, [email protected]"],
sampling_params,
)
# Guaranteed to output valid JSON matching the schema
print(outputs[0].outputs[0].text)Via the OpenAI API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Generate a user profile as JSON."}],
extra_body={
"guided_json": schema,
},
)Regex-constrained output:
sampling_params = SamplingParams(
guided_decoding=GuidedDecodingParams(
regex=r"\d{4}-\d{2}-\d{2}", # Force a date pattern
),
)Still Not Working?
vLLM vs Other Inference Engines
- vLLM — Best throughput via continuous batching and PagedAttention. Good general-purpose choice.
- Text Generation Inference (TGI) — HuggingFace’s server, similar performance. Better HF ecosystem integration.
- llama.cpp — CPU inference, quantized models. Best for edge/low-resource.
- Ollama — Wrapper over llama.cpp with easy model management. For Ollama-specific patterns, see Ollama not working.
Monitoring and Metrics
vLLM exposes Prometheus metrics:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --disable-log-stats=false
# Metrics at http://localhost:8000/metricsKey metrics:
vllm:num_requests_running— active requestsvllm:num_requests_waiting— queued requestsvllm:gpu_cache_usage_perc— KV cache utilizationvllm:time_to_first_token_seconds— TTFT latencyvllm:e2e_request_latency_seconds— end-to-end latency
Fine-Tuning and LoRA Adapters
Load LoRA adapters at runtime:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--enable-lora \
--lora-modules my-lora=/path/to/lora/adapterresponse = client.chat.completions.create(
model="my-lora", # Use the LoRA adapter
messages=[...],
)For LoRA training patterns with HuggingFace PEFT, see HuggingFace Transformers not working.
Production Deployment on Kubernetes
vLLM images are available from the official registry:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm", "serve"]
args: ["meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size=1"]
resources:
limits:
nvidia.com/gpu: 1For Kubernetes pod and resource management issues, see Kubernetes OOMKilled.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: TensorFlow Not Working — OOM, Shape Mismatch, GPU Not Found, and Keras Errors
How to fix TensorFlow errors — GPU not detected CUDA library missing, ResourceExhaustedError OOM, InvalidArgumentError shape mismatch, NaN loss, @tf.function AutoGraph failures, and Keras 3 breaking changes in TF 2.16+.
Fix: CrewAI Not Working — Agent Delegation, Task Context, and LLM Configuration Errors
How to fix CrewAI errors — LLM not configured ValidationError, agent delegation loop, task context not passed between agents, tool output truncated, process hierarchical vs sequential, and memory not persisting across runs.
Fix: FAISS Not Working — Import Errors, Index Selection, and GPU Setup
How to fix FAISS errors — ImportError cannot import name swigfaiss, faiss-gpu vs faiss-cpu install, IndexFlatL2 slow on large data, IVF training required, index serialization write_index, and dimension mismatch.
Fix: Gradio Not Working — Share Link, Queue Timeout, and Component Errors
How to fix Gradio errors — share link not working, queue timeout, component not updating, Blocks layout mistakes, flagging permission denied, file upload size limit, and HuggingFace Spaces deployment failures.