Fix: Outlines Not Working — Backend Setup, Pydantic Schemas, Regex, Choice, and Slow Sampling
Quick Answer
How to fix Python Outlines errors — model backend missing, JSON schema vs Pydantic, regex pattern compilation slow, choice list timing, vLLM/Transformers/Ollama wiring, and streaming structured outputs.
The Error
You try to load a model with Outlines and it can’t find a backend:
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
# ImportError: To use the transformers integration, please install
# transformers, sentencepiece and torch.Or your JSON-constrained generation returns plain text:
generator = outlines.generate.json(model, MySchema)
result = generator("Extract user data: ...")
# Returns a string, not a parsed object — you forgot to pass a Pydantic model.Or the regex generator takes 30+ seconds before sampling the first token:
gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
out = gen("Today is")
# Long pause before any output.Or choice returns an unexpected value not in your list:
gen = outlines.generate.choice(model, ["yes", "no"])
out = gen("Is the sky blue?")
print(out) # "Yes." — not in the list!Why This Happens
Outlines constrains the LLM’s token sampling so output must match a grammar (regex, JSON schema, choice list). Most issues come from:
- Backend selection. Outlines supports
transformers,vllm,llamacpp,mlx, and (less native)openai. Each has its own install path and API. Choosing the wrong one or missing deps breaks at load. - First-time grammar compilation is slow. Outlines converts the constraint into a finite-state machine over the model’s tokenizer. This is fast at sampling time but the build can take 10-60s for complex JSON schemas. Subsequent runs cache it.
- Choice/JSON token boundaries. The constraint operates on tokens, not characters. If
"Yes."is a single token but"yes"isn’t on the choice list, the model can pick the longer token before Outlines narrows the search. - Local model selection. Constrained sampling only helps if the underlying model can also write reasonable content. A 1B model forced into a JSON schema produces valid-shaped garbage.
Fix 1: Pick a Backend and Install Its Deps
# Transformers (HF, GPU or CPU):
pip install outlines transformers torch sentencepiece accelerate
# vLLM (GPU, fastest for production):
pip install outlines vllm
# llama.cpp:
pip install outlines llama-cpp-python
# Apple Silicon native:
pip install outlines mlx mlx-lm
# OpenAI / Anthropic (via API, no GPU needed):
pip install outlines openaiThen load:
import outlines
# Transformers:
model = outlines.models.transformers(
"microsoft/Phi-3-mini-4k-instruct",
device="cuda", # or "cpu", "mps"
)
# vLLM:
model = outlines.models.vllm("meta-llama/Meta-Llama-3-8B-Instruct")
# llama.cpp (GGUF):
model = outlines.models.llamacpp(
"TheBloke/Llama-2-7B-Chat-GGUF",
"llama-2-7b-chat.Q4_K_M.gguf",
)
# OpenAI:
model = outlines.models.openai("gpt-4o-mini")Pro Tip: For production deployments, prefer vLLM. It’s significantly faster than Transformers for batched constrained generation because Outlines integrates as a logits processor without per-token Python overhead.
Note: The OpenAI backend can’t do true constrained generation (no logits access). It uses prompt engineering + post-validation under the hood. For real constraint enforcement, you need a model whose logits you control.
Fix 2: JSON Generation With Pydantic
The cleanest way to define a JSON schema:
from pydantic import BaseModel, Field
import outlines
class User(BaseModel):
name: str
age: int = Field(ge=0, le=150)
email: str | None = None
generator = outlines.generate.json(model, User)
result = generator("Extract: John Doe is 30 years old, email [email protected]")
print(result)
# User(name='John Doe', age=30, email='[email protected]')Pydantic’s Field(ge=..., le=..., pattern=...) constraints translate to schema constraints that Outlines enforces during sampling.
For non-Pydantic schemas, pass raw JSON schema:
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
},
"required": ["name", "age"],
}
generator = outlines.generate.json(model, schema)
result = generator("...")
# result is a dict, not a Pydantic modelCommon Mistake: Schemas with $ref to external definitions. Outlines doesn’t follow $ref outside the schema. Inline everything or use Pydantic’s auto-generated schema (which inlines via $defs).
Fix 3: Choice With Token-Aware Options
For multi-class classification with strict outputs:
generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
sentiment = generator("Review: 'Loved the food, hated the service.'")Outlines constrains sampling so only tokens that lead to one of the choices are allowed. The output is guaranteed to be exactly one of the strings.
Common Mistake: Choices that are prefixes of each other:
generator = outlines.generate.choice(model, ["yes", "yes_strongly"])
# Ambiguous: after "yes", the model might continue to "_strongly" or stop.Fix by adding terminating context or making choices unambiguous:
generator = outlines.generate.choice(model, ["yes", "yes_strongly", "no"])If you need free-form output that starts with one of several phrases, use regex instead:
gen = outlines.generate.regex(model, r"(positive|negative|neutral)\b.*")Fix 4: Regex Generation
For arbitrary patterns:
# Phone number:
gen = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")
# ISO date:
gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
# IP address:
gen = outlines.generate.regex(model, r"(\d{1,3}\.){3}\d{1,3}")The compilation cost for these is one-time per (model, regex) pair. Outlines caches it across calls for the same generator object — reuse generators rather than creating fresh ones in a loop.
# Slow — recompiles on every call:
for prompt in prompts:
gen = outlines.generate.regex(model, pattern)
out = gen(prompt)
# Fast — compile once:
gen = outlines.generate.regex(model, pattern)
for prompt in prompts:
out = gen(prompt)For very complex regex (lots of alternation, deep nesting), compilation can take 30s+. Simplify the pattern or use JSON/choice if applicable.
Fix 5: Streaming Constrained Output
For long outputs you want to display progressively:
generator = outlines.generate.text(model)
for chunk in generator.stream("Write a story about..."):
print(chunk, end="", flush=True)For constrained generators that have a stream method:
gen = outlines.generate.json(model, User)
for partial in gen.stream("Extract: ..."):
print(partial)Streaming a JSON generator yields tokens; you reconstruct the partial JSON yourself. Not all backends support streaming uniformly — Transformers and vLLM do; llama.cpp and OpenAI depend on the version.
Note: Constrained streaming is rarely a UX improvement for short structured outputs (a User object completes in <500ms). Streaming shines for long-form text or large arrays.
Fix 6: Multiple Generators Share One Model
Loading a 7B model takes 20-60s and 14 GB of VRAM. Don’t reload per generator:
# Load model once:
model = outlines.models.transformers("model-name")
# Reuse for multiple generators:
sentiment_gen = outlines.generate.choice(model, ["positive", "negative"])
date_gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
user_gen = outlines.generate.json(model, User)Each generator caches its FSM independently. Switching between them is free at runtime.
For batched throughput, prefer vLLM:
model = outlines.models.vllm("model-name", gpu_memory_utilization=0.9)
gen = outlines.generate.json(model, User)
# Batch processing — vLLM batches internally:
results = [gen(prompt) for prompt in prompts]Fix 7: Sampling Parameters
Control creativity and length:
from outlines.samplers import multinomial, greedy
# Default is multinomial sampling at temp=1.0 — varied outputs.
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))
# Greedy: deterministic, picks highest-probability token at each step.
generator = outlines.generate.text(model, sampler=greedy())
# Generate with explicit max tokens:
result = generator("...", max_tokens=512)For reproducible runs:
import torch
torch.manual_seed(42)
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))Common Mistake: Using multinomial(temperature=0.0). Multinomial with zero temperature is undefined; use greedy() for “always pick the most likely token.”
Fix 8: Prompt Templates
For chat models, format the prompt with the model’s expected template:
# Manual templating:
prompt = """<|user|>
Extract user info from: John Doe, age 30
<|end|>
<|assistant|>"""
result = generator(prompt)Or use the tokenizer’s template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
messages = [
{"role": "system", "content": "You extract user info as JSON."},
{"role": "user", "content": "John Doe, age 30"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = generator(prompt)For Outlines + vLLM, pass the chat template via the vLLM model:
model = outlines.models.vllm("model-name")
# vLLM applies the template automatically when you pass messages.Pro Tip: Mismatched chat templates are the #1 reason a model that “should be smart” returns nonsense. Always check the model’s README for the exact template format.
Still Not Working?
A few less-obvious failures:
- Generation starts but produces gibberish for the first 5 tokens. The model is warming up the cache. Discard the first run or use
model.warmup()if your backend exposes it. jsongeneration returnsNonefor optional fields. That’s correct — your schema marks them optional and the model didn’t fill them. Tighten the prompt if you want all fields populated.- vLLM Outlines plugin not picked up. vLLM needs the outlines-vllm integration. Install via
pip install outlines[vllm]or check vLLM’s--guided-decoding-backendflag. mlxbackend returns empty strings on M-series Macs. MLX support is newer and less stable. Trytransformerswithdevice="mps"as a fallback while you debug.max_tokensexceeded with no warning. Outlines silently truncates if the model hits the limit before completing the schema. For structured output, setmax_tokensgenerously.- GPU runs out of memory immediately. Constrained generation has some overhead. Reduce
gpu_memory_utilization(vLLM) or batch size, or use a smaller model. outlines.generate.cfgfor context-free grammars hangs. CFG support is experimental and the grammar must be in Lark format. Stick to regex/JSON for production.- Pydantic v1 schemas reject extra fields. Outlines 0.x targets Pydantic v2. Update your models or pin a compatible Outlines version.
For related LLM constraint and validation issues, see Instructor not working, DSPy not working, Pydantic validation error, and vLLM not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Instructor Not Working — Validation Loops, Mode Mismatch, Streaming, and Anthropic / Gemini Issues
How to fix Python Instructor errors — ValidationError loops, max_retries exhausted, mode=Mode.TOOLS vs JSON, partial streaming type errors, Anthropic and Gemini client patching, token usage tracking.
Fix: vLLM Not Working — CUDA OOM, Model Loading, and API Server Errors
How to fix vLLM errors — CUDA out of memory during model load, tokenizer mismatch with HuggingFace, tensor parallel size does not match GPU count, KV cache exceeds memory, OpenAI API compatibility issues, and max_model_len too large.
Fix: DSPy Not Working — LM Configuration, Signatures, Modules, Optimizers, and Cache Surprises
How to fix DSPy errors — no LM configured, signature field types, ChainOfThought vs Predict, optimizer (MIPROv2) setup, retrieval module wiring, async usage, and cache invalidation between runs.
Fix: LiteLLM Not Working — Model Name Format, API Keys, Streaming, and Fallback Errors
How to fix LiteLLM errors — BadRequestError model not found, missing API key env vars, streaming chunk differences, fallback model not triggering, async drop_params, and proxy server 401.