Skip to content

Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.

The Error

You load a 70B model and CUDA runs out of memory before the first token:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.

Or the tokenizer produces garbage because it doesn’t match the model:

Generated text: "<|im_start|><|im_start|>a a a a a a..."

Or you start the OpenAI-compatible server and requests time out:

openai.APIConnectionError: Connection error.
httpx.ConnectTimeout: timed out

Or tensor parallelism fails with a mismatched GPU count:

ValueError: Cannot split tensors evenly. tensor_parallel_size=4 but only 2 GPUs available

Or the max context length error hits at inference time:

ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens.

vLLM is built around two clever ideas: PagedAttention (efficient KV cache allocation) and continuous batching (new requests join mid-batch). These make it 2–10x faster than naive HuggingFace inference — but they also create failure modes specific to memory allocation and request scheduling that don’t exist in simpler inference engines.

Why This Happens

vLLM’s speed depends on aggressive memory allocation: it reserves as much GPU memory as possible for the KV cache (key-value attention states) ahead of time. This is great for throughput but brittle — if the cache is too small, long prompts fail; if it’s too large, the model itself can’t load. The --gpu-memory-utilization flag (default 0.90) controls this trade-off.

Tokenizer bugs are common because vLLM uses HuggingFace tokenizers and expects them to produce exactly the same token IDs the model was trained on. Mismatched versions (e.g., Llama 3 tokenizer loaded for a Llama 2 model) produce gibberish.

Diagnostic Timeline

When vLLM fails to start or under-performs, the reflex is “I’m OOM, let me try a smaller model.” That is wrong roughly two-thirds of the time. Walk this timeline first.

Minute 0 — Wrong first instinct. You see CUDA out of memory in the log, swap a 70B model for a 13B one, and accept worse quality. Or you cut max_model_len from 32K to 4K and break user requests. Both are bandages. The real fix is usually a configuration mismatch — tensor_parallel_size set wrong, gpu_memory_utilization set too low so the cache is starved, or an attention backend that does not match your hardware.

Minute 1 — Discriminating evidence. Read the first 30 lines of vLLM’s startup log carefully. vLLM prints Tensor parallel size: N, the resolved attention backend (xFormers, FlashAttention, Triton), the KV cache size in tokens, and the chosen gpu_memory_utilization. Three lines tell you almost everything: if KV cache is 0 tokens, the model ate all available memory; if the attention backend is Triton on an A100, you are running the slow path; if Tensor parallel size: 1 on a 2-GPU box, you forgot the flag.

Minute 2 — Next check. Run nvidia-smi while the server is loading. Watch the per-GPU memory rise. With tensor_parallel_size=N, you should see N GPUs filling at roughly the same rate. If only one GPU loads, vLLM did not pick up the TP setting (typo, or CUDA_VISIBLE_DEVICES mask hiding GPUs). If one GPU loads to 100% and another sits at 0%, you have a topology problem (NVLink missing, or NCCL_P2P_DISABLE=1 set somewhere).

Minute 3 — Actual root cause. Three causes account for almost every vLLM startup failure:

  1. tensor_parallel_size mismatch. You set --tensor-parallel-size 4 on a 2-GPU node and get a hard error. Or you set --tensor-parallel-size 1 on a 4-GPU node and the model OOMs because it cannot use the other 3. Always match TP to the actual GPU count and confirm it divides the model’s attention head count.
  2. gpu_memory_utilization too low. The default is 0.90. Setting 0.50 to “leave room” actually starves the KV cache so you fit only one or two concurrent requests — throughput collapses. The model itself only uses a fixed amount; the rest is cache. Raise utilization toward 0.95 unless you also run other processes on the GPU.
  3. Attention backend mismatch (Triton vs xFormers vs FlashAttention). On Ampere/Hopper, FlashAttention is fastest but requires a recent vLLM build and the right CUDA/torch versions. If vLLM falls back to Triton silently, throughput halves. Force the backend with --attention-backend FLASH_ATTN and watch for the import error that explains why the fast path is unavailable.

If none of these fit, then look at model size and quantization. By then you actually know whether you are out of room or out of configuration.

Fix 1: CUDA OOM During Model Loading

torch.cuda.OutOfMemoryError: CUDA out of memory.

Check whether the model actually fits. Approximate memory per parameter:

PrecisionBytes/param7B model70B model
FP32428 GB280 GB
FP16 / BF16214 GB140 GB
INT817 GB70 GB
INT4 (AWQ / GPTQ)0.53.5 GB35 GB

Add ~30% overhead for activations and KV cache. A 70B FP16 model needs at least 180 GB of GPU memory — doesn’t fit on a single A100 (80GB) without quantization or tensor parallelism.

Reduce gpu_memory_utilization to leave more room for the model itself:

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    gpu_memory_utilization=0.80,   # Default 0.90 — reduce if model barely fits
    max_model_len=4096,             # Reduce KV cache size
)

Use quantization to fit large models on smaller hardware:

# 4-bit AWQ — best quality for 4-bit
llm = LLM(model="TheBloke/Llama-2-70B-Chat-AWQ")

# GPTQ — another 4-bit option
llm = LLM(model="TheBloke/Llama-2-70B-Chat-GPTQ")

# FP8 (H100, Ada Lovelace architecture)
llm = LLM(
    model="neuralmagic/Meta-Llama-3-70B-Instruct-FP8",
    quantization="fp8",
)

Use tensor parallelism to split across multiple GPUs:

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    tensor_parallel_size=4,   # Split across 4 GPUs
)

tensor_parallel_size must equal the number of GPUs you have AND divide evenly into the model’s attention head count (usually powers of 2: 1, 2, 4, 8).

Pro Tip: Start with small max_model_len and gpu_memory_utilization=0.85. Once the model loads successfully, increase both until you find the limit. vLLM’s default max_model_len matches the model’s max context — but reducing it frees VRAM for larger batch sizes, which often matters more for throughput than supporting the full context length.

Fix 2: Tokenizer Mismatch — Garbage Output

Prompt: "Hello, how are you?"
Generated: "<|im_start|><|im_start|><|im_start|>a a a a a..."

The tokenizer doesn’t match the model. Common causes:

  1. Loaded a custom tokenizer.json from one model with weights from another.
  2. Chat template mismatch — Llama 3 expects different special tokens than Llama 2.
  3. Tokenizer version drift — the tokenizer was updated after the model was trained.

Use the model’s own tokenizer explicitly:

from vllm import LLM

# vLLM loads the tokenizer from the model repo by default
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# If you need a different tokenizer (unusual), specify it
llm = LLM(
    model="custom-org/custom-llama-variant",
    tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",   # Use Llama 3 tokenizer
)

Use chat templates instead of raw prompts for instruction-tuned models:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = llm.get_tokenizer()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum entanglement briefly."},
]

# Apply the model's chat template — handles special tokens correctly
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Common Mistake: Concatenating raw strings like "User: {question}\nAssistant: " for instruction-tuned models. Most modern chat models (Llama 3, Mistral, Qwen) use specific special tokens like <|start_header_id|>user<|end_header_id|>. The chat template generates them correctly; raw concatenation doesn’t, and the model’s output quality drops dramatically.

Confirm the tokenizer being loaded by printing llm.get_tokenizer().__class__.__name__ and the special token IDs after construction — silent mismatches between the model’s expected bos_token_id and what the tokenizer returns are the root cause of most “the chat output is gibberish” bugs.

Fix 3: OpenAI-Compatible API Server

vLLM can run as an OpenAI API drop-in replacement:

# Start the server
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

Use with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",                  # vLLM doesn't require auth by default
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 2+2?"},
    ],
    temperature=0.7,
    max_tokens=100,
)
print(response.choices[0].message.content)

Streaming responses:

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Connection timeouts — the server takes time to load the model:

# Check the server started successfully before sending requests
curl http://localhost:8000/health
# Returns 200 when model is loaded and ready

Set API key for production:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --api-key your-secret-key
client = OpenAI(
    api_key="your-secret-key",
    base_url="http://vllm-server:8000/v1",
)

For OpenAI API client patterns and streaming, see OpenAI API not working.

Fix 4: Tensor Parallelism — GPU Count Matters

ValueError: Cannot split tensors evenly.
tensor_parallel_size=4 but only 2 GPUs available

Rules for tensor_parallel_size:

  1. Must match the number of available GPUs: nvidia-smi shows your count
  2. Must divide the model’s attention head count evenly
  3. Stick to powers of 2 (1, 2, 4, 8) for best compatibility
# Check GPU count
nvidia-smi -L
# GPU 0: NVIDIA A100-SXM4-80GB
# GPU 1: NVIDIA A100-SXM4-80GB
# → 2 GPUs available

# Set tensor parallel to 2
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 2

Pipeline parallelism for models that don’t fit even with tensor parallelism:

llm = LLM(
    model="meta-llama/Llama-3-405B-Instruct",
    tensor_parallel_size=8,
    pipeline_parallel_size=4,   # 8*4 = 32 GPUs total
)

Uneven GPU assignment for multi-model setups:

# Use specific GPUs, not all visible
CUDA_VISIBLE_DEVICES=0,1 vllm serve model_a --tensor-parallel-size 2 --port 8000
CUDA_VISIBLE_DEVICES=2,3 vllm serve model_b --tensor-parallel-size 2 --port 8001

Fix 5: KV Cache — max_model_len and Batching

ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens (7936 in prompt; 256 for generation).

vLLM pre-allocates the KV cache based on max_model_len. Requests longer than this are rejected, even if the underlying model supports longer contexts.

Set max_model_len to your actual needs:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --max-model-len 32768   # Supports up to 32K context

The trade-off: larger max_model_len means fewer concurrent requests fit in GPU memory, because each request’s KV cache is sized for the max length.

# Scenario A: Many short requests
--max-model-len 4096   # Fits more concurrent requests, each up to 4K

# Scenario B: Few long requests  
--max-model-len 32768  # Fewer concurrent, each up to 32K

Estimate concurrent request capacity:

available_kv_memory = (total_gpu_mem * gpu_memory_utilization) - model_size
kv_per_request = max_model_len * num_layers * hidden_size * 2 * 2 (FP16 bytes)
max_concurrent = available_kv_memory / kv_per_request

For Llama-3-8B (32 layers, 4096 hidden) at max_model_len=4096 in FP16:

  • kv_per_request = 4096 * 32 * 4096 * 2 * 2 = 2 GB per request
  • On 80GB GPU with 16GB model: 64GB / 2GB = ~30 concurrent requests

Prefix caching reuses KV cache across requests with shared prefixes (e.g., system prompts):

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching

Dramatic speedup for chatbot/agent workloads where every request starts with the same system prompt.

Fix 6: Sampling Parameters and Output Quality

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# Default sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,        # Higher = more random. 0 = greedy (deterministic)
    top_p=0.95,             # Nucleus sampling — consider top 95% probability mass
    top_k=50,               # Top-K sampling (disabled if -1)
    max_tokens=256,         # Max new tokens to generate
    frequency_penalty=0.0,  # Penalize frequent tokens
    presence_penalty=0.0,   # Penalize any repeated token
    stop=["<|eot_id|>"],    # Stop sequences
    seed=42,                # Reproducibility
)

outputs = llm.generate(prompts, sampling_params)

Deterministic output (for testing, reproducibility):

SamplingParams(
    temperature=0.0,     # Greedy — always picks the highest probability token
    max_tokens=256,
    seed=42,
)

Output token probabilities for debugging:

SamplingParams(
    logprobs=5,   # Top 5 logprobs per token
    max_tokens=50,
)

outputs = llm.generate(prompts, sampling_params)
for token, logprob in zip(outputs[0].outputs[0].token_ids, outputs[0].outputs[0].logprobs):
    print(token, logprob)

Fix 7: Model Download and HuggingFace Authentication

OSError: meta-llama/Meta-Llama-3-8B-Instruct is not a local folder and is
not a valid model identifier on Hugging Face

Gated models (Llama, Mistral, etc.) require authentication:

# Set token via environment variable
export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

# Or login via HuggingFace CLI
huggingface-cli login

Pre-download the model to avoid download delays at server start:

# Download to the default cache
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct

# vLLM uses the cached version
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

Custom model path for local or self-hosted models:

vllm serve /path/to/local/model --served-model-name my-model
# Use the served name in API calls
response = client.chat.completions.create(
    model="my-model",   # Matches --served-model-name
    messages=[...],
)

Mirror/proxy for restricted networks:

# Set HF_ENDPOINT to your mirror
export HF_ENDPOINT=https://hf-mirror.com
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

If huggingface-cli login worked but vllm serve still complains the model is gated, your process is reading a different HOME directory than your shell — check HF_HOME in the systemd unit or Docker env to confirm.

Fix 8: Structured Output and JSON Mode

vLLM supports structured output via the guided_* parameters:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# JSON schema constraint
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string"},
    },
    "required": ["name", "age"],
}

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    guided_decoding=GuidedDecodingParams(json=schema),
)

outputs = llm.generate(
    ["Extract user info from: Alice is 30 years old, [email protected]"],
    sampling_params,
)
# Guaranteed to output valid JSON matching the schema
print(outputs[0].outputs[0].text)

Via the OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a user profile as JSON."}],
    extra_body={
        "guided_json": schema,
    },
)

Regex-constrained output:

sampling_params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        regex=r"\d{4}-\d{2}-\d{2}",   # Force a date pattern
    ),
)

Still Not Working?

vLLM vs Other Inference Engines

  • vLLM — Best throughput via continuous batching and PagedAttention. Good general-purpose choice.
  • Text Generation Inference (TGI) — HuggingFace’s server, similar performance. Better HF ecosystem integration.
  • llama.cpp — CPU inference, quantized models. Best for edge/low-resource.
  • Ollama — Wrapper over llama.cpp with easy model management. For Ollama-specific patterns, see Ollama not working.

Monitoring and Metrics

vLLM exposes Prometheus metrics:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --disable-log-stats=false
# Metrics at http://localhost:8000/metrics

Key metrics:

  • vllm:num_requests_running — active requests
  • vllm:num_requests_waiting — queued requests
  • vllm:gpu_cache_usage_perc — KV cache utilization
  • vllm:time_to_first_token_seconds — TTFT latency
  • vllm:e2e_request_latency_seconds — end-to-end latency

Fine-Tuning and LoRA Adapters

Load LoRA adapters at runtime:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --enable-lora \
    --lora-modules my-lora=/path/to/lora/adapter
response = client.chat.completions.create(
    model="my-lora",   # Use the LoRA adapter
    messages=[...],
)

LoRA adapters must match the base model’s architecture exactly — loading an adapter trained against Meta-Llama-3-8B onto Meta-Llama-3-8B-Instruct may load without error but produce degraded outputs because the chat-template alignment differs. Train against the variant you serve.

Production Deployment on Kubernetes

vLLM images are available from the official registry:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command: ["vllm", "serve"]
          args: ["meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size=1"]
          resources:
            limits:
              nvidia.com/gpu: 1

For Kubernetes pod and resource management issues, see Kubernetes OOMKilled.

Throughput Drops After 30 Minutes of Steady Load

You ramp traffic to vLLM, it sustains 200 req/s for half an hour, then collapses to 40 req/s with no errors. Almost always KV cache fragmentation under continuous batching — long-running requests pin cache pages that newly-arriving short requests cannot share. Mitigate with --enable-prefix-caching for shared system prompts, lower max_num_seqs to reduce concurrency, or restart on a schedule. vLLM has improved fragmentation handling in recent releases; upgrade before redesigning your traffic shaping.

Streaming Responses Stall Mid-Generation

The OpenAI-compatible streaming endpoint emits the first chunk fast, then pauses for seconds before the next token. The usual cause is your client buffering — requests and some HTTP/2 proxies aggregate chunks rather than flushing per-token. Use httpx or the official OpenAI Python SDK with streaming enabled, and verify there is no nginx buffering proxy in front (proxy_buffering off for SSE endpoints).

Tokenizer Loads From the Wrong HuggingFace Cache

vLLM caches models under HF_HOME (or ~/.cache/huggingface). On a multi-user GPU server, two users hitting the same model name can race on the lockfile and one ends up with a corrupt download — symptoms include random KeyError on tokenizer config or “tensor shape mismatch” at load. Use HF_HUB_DISABLE_TELEMETRY=1 and per-user HF_HOME directories to avoid the race. For broader HuggingFace cache and token issues that affect vLLM startup, see HuggingFace Transformers not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles