Skip to content

Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors

FixDevs ·

Quick Answer

How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.

The Error

You load a 70B model and CUDA runs out of memory before the first token:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 1.24 GiB is free.

Or the tokenizer produces garbage because it doesn’t match the model:

Generated text: "<|im_start|><|im_start|>a a a a a a..."

Or you start the OpenAI-compatible server and requests time out:

openai.APIConnectionError: Connection error.
httpx.ConnectTimeout: timed out

Or tensor parallelism fails with a mismatched GPU count:

ValueError: Cannot split tensors evenly. tensor_parallel_size=4 but only 2 GPUs available

Or the max context length error hits at inference time:

ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens.

vLLM is built around two clever ideas: PagedAttention (efficient KV cache allocation) and continuous batching (new requests join mid-batch). These make it 2–10x faster than naive HuggingFace inference — but they also create failure modes specific to memory allocation and request scheduling that don’t exist in simpler inference engines.

Why This Happens

vLLM’s speed depends on aggressive memory allocation: it reserves as much GPU memory as possible for the KV cache (key-value attention states) ahead of time. This is great for throughput but brittle — if the cache is too small, long prompts fail; if it’s too large, the model itself can’t load. The --gpu-memory-utilization flag (default 0.90) controls this trade-off.

Tokenizer bugs are common because vLLM uses HuggingFace tokenizers and expects them to produce exactly the same token IDs the model was trained on. Mismatched versions (e.g., Llama 3 tokenizer loaded for a Llama 2 model) produce gibberish.

Fix 1: CUDA OOM During Model Loading

torch.cuda.OutOfMemoryError: CUDA out of memory.

Check whether the model actually fits. Approximate memory per parameter:

PrecisionBytes/param7B model70B model
FP32428 GB280 GB
FP16 / BF16214 GB140 GB
INT817 GB70 GB
INT4 (AWQ / GPTQ)0.53.5 GB35 GB

Add ~30% overhead for activations and KV cache. A 70B FP16 model needs at least 180 GB of GPU memory — doesn’t fit on a single A100 (80GB) without quantization or tensor parallelism.

Reduce gpu_memory_utilization to leave more room for the model itself:

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    gpu_memory_utilization=0.80,   # Default 0.90 — reduce if model barely fits
    max_model_len=4096,             # Reduce KV cache size
)

Use quantization to fit large models on smaller hardware:

# 4-bit AWQ — best quality for 4-bit
llm = LLM(model="TheBloke/Llama-2-70B-Chat-AWQ")

# GPTQ — another 4-bit option
llm = LLM(model="TheBloke/Llama-2-70B-Chat-GPTQ")

# FP8 (H100, Ada Lovelace architecture)
llm = LLM(
    model="neuralmagic/Meta-Llama-3-70B-Instruct-FP8",
    quantization="fp8",
)

Use tensor parallelism to split across multiple GPUs:

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    tensor_parallel_size=4,   # Split across 4 GPUs
)

tensor_parallel_size must equal the number of GPUs you have AND divide evenly into the model’s attention head count (usually powers of 2: 1, 2, 4, 8).

Pro Tip: Start with small max_model_len and gpu_memory_utilization=0.85. Once the model loads successfully, increase both until you find the limit. vLLM’s default max_model_len matches the model’s max context — but reducing it frees VRAM for larger batch sizes, which often matters more for throughput than supporting the full context length.

Fix 2: Tokenizer Mismatch — Garbage Output

Prompt: "Hello, how are you?"
Generated: "<|im_start|><|im_start|><|im_start|>a a a a a..."

The tokenizer doesn’t match the model. Common causes:

  1. Loaded a custom tokenizer.json from one model with weights from another.
  2. Chat template mismatch — Llama 3 expects different special tokens than Llama 2.
  3. Tokenizer version drift — the tokenizer was updated after the model was trained.

Use the model’s own tokenizer explicitly:

from vllm import LLM

# vLLM loads the tokenizer from the model repo by default
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# If you need a different tokenizer (unusual), specify it
llm = LLM(
    model="custom-org/custom-llama-variant",
    tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",   # Use Llama 3 tokenizer
)

Use chat templates instead of raw prompts for instruction-tuned models:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = llm.get_tokenizer()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum entanglement briefly."},
]

# Apply the model's chat template — handles special tokens correctly
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Common Mistake: Concatenating raw strings like "User: {question}\nAssistant: " for instruction-tuned models. Most modern chat models (Llama 3, Mistral, Qwen) use specific special tokens like <|start_header_id|>user<|end_header_id|>. The chat template generates them correctly; raw concatenation doesn’t, and the model’s output quality drops dramatically.

For HuggingFace tokenizer patterns and chat templates, see HuggingFace Transformers not working.

Fix 3: OpenAI-Compatible API Server

vLLM can run as an OpenAI API drop-in replacement:

# Start the server
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

Use with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",                  # vLLM doesn't require auth by default
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 2+2?"},
    ],
    temperature=0.7,
    max_tokens=100,
)
print(response.choices[0].message.content)

Streaming responses:

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Connection timeouts — the server takes time to load the model:

# Check the server started successfully before sending requests
curl http://localhost:8000/health
# Returns 200 when model is loaded and ready

Set API key for production:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --api-key your-secret-key
client = OpenAI(
    api_key="your-secret-key",
    base_url="http://vllm-server:8000/v1",
)

For OpenAI API client patterns and streaming, see OpenAI API not working.

Fix 4: Tensor Parallelism — GPU Count Matters

ValueError: Cannot split tensors evenly.
tensor_parallel_size=4 but only 2 GPUs available

Rules for tensor_parallel_size:

  1. Must match the number of available GPUs: nvidia-smi shows your count
  2. Must divide the model’s attention head count evenly
  3. Stick to powers of 2 (1, 2, 4, 8) for best compatibility
# Check GPU count
nvidia-smi -L
# GPU 0: NVIDIA A100-SXM4-80GB
# GPU 1: NVIDIA A100-SXM4-80GB
# → 2 GPUs available

# Set tensor parallel to 2
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 2

Pipeline parallelism for models that don’t fit even with tensor parallelism:

llm = LLM(
    model="meta-llama/Llama-3-405B-Instruct",
    tensor_parallel_size=8,
    pipeline_parallel_size=4,   # 8*4 = 32 GPUs total
)

Uneven GPU assignment for multi-model setups:

# Use specific GPUs, not all visible
CUDA_VISIBLE_DEVICES=0,1 vllm serve model_a --tensor-parallel-size 2 --port 8000
CUDA_VISIBLE_DEVICES=2,3 vllm serve model_b --tensor-parallel-size 2 --port 8001

Fix 5: KV Cache — max_model_len and Batching

ValueError: This model's maximum context length is 4096 tokens.
Your request had 8192 tokens (7936 in prompt; 256 for generation).

vLLM pre-allocates the KV cache based on max_model_len. Requests longer than this are rejected, even if the underlying model supports longer contexts.

Set max_model_len to your actual needs:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --max-model-len 32768   # Supports up to 32K context

The trade-off: larger max_model_len means fewer concurrent requests fit in GPU memory, because each request’s KV cache is sized for the max length.

# Scenario A: Many short requests
--max-model-len 4096   # Fits more concurrent requests, each up to 4K

# Scenario B: Few long requests  
--max-model-len 32768  # Fewer concurrent, each up to 32K

Estimate concurrent request capacity:

available_kv_memory = (total_gpu_mem * gpu_memory_utilization) - model_size
kv_per_request = max_model_len * num_layers * hidden_size * 2 * 2 (FP16 bytes)
max_concurrent = available_kv_memory / kv_per_request

For Llama-3-8B (32 layers, 4096 hidden) at max_model_len=4096 in FP16:

  • kv_per_request = 4096 * 32 * 4096 * 2 * 2 = 2 GB per request
  • On 80GB GPU with 16GB model: 64GB / 2GB = ~30 concurrent requests

Prefix caching reuses KV cache across requests with shared prefixes (e.g., system prompts):

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching

Dramatic speedup for chatbot/agent workloads where every request starts with the same system prompt.

Fix 6: Sampling Parameters and Output Quality

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# Default sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,        # Higher = more random. 0 = greedy (deterministic)
    top_p=0.95,             # Nucleus sampling — consider top 95% probability mass
    top_k=50,               # Top-K sampling (disabled if -1)
    max_tokens=256,         # Max new tokens to generate
    frequency_penalty=0.0,  # Penalize frequent tokens
    presence_penalty=0.0,   # Penalize any repeated token
    stop=["<|eot_id|>"],    # Stop sequences
    seed=42,                # Reproducibility
)

outputs = llm.generate(prompts, sampling_params)

Deterministic output (for testing, reproducibility):

SamplingParams(
    temperature=0.0,     # Greedy — always picks the highest probability token
    max_tokens=256,
    seed=42,
)

Output token probabilities for debugging:

SamplingParams(
    logprobs=5,   # Top 5 logprobs per token
    max_tokens=50,
)

outputs = llm.generate(prompts, sampling_params)
for token, logprob in zip(outputs[0].outputs[0].token_ids, outputs[0].outputs[0].logprobs):
    print(token, logprob)

Fix 7: Model Download and HuggingFace Authentication

OSError: meta-llama/Meta-Llama-3-8B-Instruct is not a local folder and is
not a valid model identifier on Hugging Face

Gated models (Llama, Mistral, etc.) require authentication:

# Set token via environment variable
export HF_TOKEN=hf_your_token_here
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

# Or login via HuggingFace CLI
huggingface-cli login

Pre-download the model to avoid download delays at server start:

# Download to the default cache
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct

# vLLM uses the cached version
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

Custom model path for local or self-hosted models:

vllm serve /path/to/local/model --served-model-name my-model
# Use the served name in API calls
response = client.chat.completions.create(
    model="my-model",   # Matches --served-model-name
    messages=[...],
)

Mirror/proxy for restricted networks:

# Set HF_ENDPOINT to your mirror
export HF_ENDPOINT=https://hf-mirror.com
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

For HuggingFace authentication patterns and HF_TOKEN setup, see HuggingFace Transformers not working.

Fix 8: Structured Output and JSON Mode

vLLM supports structured output via the guided_* parameters:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# JSON schema constraint
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string"},
    },
    "required": ["name", "age"],
}

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    guided_decoding=GuidedDecodingParams(json=schema),
)

outputs = llm.generate(
    ["Extract user info from: Alice is 30 years old, [email protected]"],
    sampling_params,
)
# Guaranteed to output valid JSON matching the schema
print(outputs[0].outputs[0].text)

Via the OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a user profile as JSON."}],
    extra_body={
        "guided_json": schema,
    },
)

Regex-constrained output:

sampling_params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        regex=r"\d{4}-\d{2}-\d{2}",   # Force a date pattern
    ),
)

Still Not Working?

vLLM vs Other Inference Engines

  • vLLM — Best throughput via continuous batching and PagedAttention. Good general-purpose choice.
  • Text Generation Inference (TGI) — HuggingFace’s server, similar performance. Better HF ecosystem integration.
  • llama.cpp — CPU inference, quantized models. Best for edge/low-resource.
  • Ollama — Wrapper over llama.cpp with easy model management. For Ollama-specific patterns, see Ollama not working.

Monitoring and Metrics

vLLM exposes Prometheus metrics:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --disable-log-stats=false
# Metrics at http://localhost:8000/metrics

Key metrics:

  • vllm:num_requests_running — active requests
  • vllm:num_requests_waiting — queued requests
  • vllm:gpu_cache_usage_perc — KV cache utilization
  • vllm:time_to_first_token_seconds — TTFT latency
  • vllm:e2e_request_latency_seconds — end-to-end latency

Fine-Tuning and LoRA Adapters

Load LoRA adapters at runtime:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --enable-lora \
    --lora-modules my-lora=/path/to/lora/adapter
response = client.chat.completions.create(
    model="my-lora",   # Use the LoRA adapter
    messages=[...],
)

For LoRA training patterns with HuggingFace PEFT, see HuggingFace Transformers not working.

Production Deployment on Kubernetes

vLLM images are available from the official registry:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command: ["vllm", "serve"]
          args: ["meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size=1"]
          resources:
            limits:
              nvidia.com/gpu: 1

For Kubernetes pod and resource management issues, see Kubernetes OOMKilled.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles