Skip to content

Fix: Outlines Not Working — Backend Setup, Pydantic Schemas, Regex, Choice, and Slow Sampling

FixDevs ·

Quick Answer

How to fix Python Outlines errors — model backend missing, JSON schema vs Pydantic, regex pattern compilation slow, choice list timing, vLLM/Transformers/Ollama wiring, and streaming structured outputs.

The Error

You try to load a model with Outlines and it can’t find a backend:

import outlines

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
# ImportError: To use the transformers integration, please install
# transformers, sentencepiece and torch.

Or your JSON-constrained generation returns plain text:

generator = outlines.generate.json(model, MySchema)
result = generator("Extract user data: ...")
# Returns a string, not a parsed object — you forgot to pass a Pydantic model.

Or the regex generator takes 30+ seconds before sampling the first token:

gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
out = gen("Today is")
# Long pause before any output.

Or choice returns an unexpected value not in your list:

gen = outlines.generate.choice(model, ["yes", "no"])
out = gen("Is the sky blue?")
print(out)  # "Yes." — not in the list!

Why This Happens

Outlines constrains the LLM’s token sampling so output must match a grammar (regex, JSON schema, choice list). Most issues come from:

  • Backend selection. Outlines supports transformers, vllm, llamacpp, mlx, and (less native) openai. Each has its own install path and API. Choosing the wrong one or missing deps breaks at load.
  • First-time grammar compilation is slow. Outlines converts the constraint into a finite-state machine over the model’s tokenizer. This is fast at sampling time but the build can take 10-60s for complex JSON schemas. Subsequent runs cache it.
  • Choice/JSON token boundaries. The constraint operates on tokens, not characters. If "Yes." is a single token but "yes" isn’t on the choice list, the model can pick the longer token before Outlines narrows the search.
  • Local model selection. Constrained sampling only helps if the underlying model can also write reasonable content. A 1B model forced into a JSON schema produces valid-shaped garbage.

Fix 1: Pick a Backend and Install Its Deps

# Transformers (HF, GPU or CPU):
pip install outlines transformers torch sentencepiece accelerate

# vLLM (GPU, fastest for production):
pip install outlines vllm

# llama.cpp:
pip install outlines llama-cpp-python

# Apple Silicon native:
pip install outlines mlx mlx-lm

# OpenAI / Anthropic (via API, no GPU needed):
pip install outlines openai

Then load:

import outlines

# Transformers:
model = outlines.models.transformers(
    "microsoft/Phi-3-mini-4k-instruct",
    device="cuda",  # or "cpu", "mps"
)

# vLLM:
model = outlines.models.vllm("meta-llama/Meta-Llama-3-8B-Instruct")

# llama.cpp (GGUF):
model = outlines.models.llamacpp(
    "TheBloke/Llama-2-7B-Chat-GGUF",
    "llama-2-7b-chat.Q4_K_M.gguf",
)

# OpenAI:
model = outlines.models.openai("gpt-4o-mini")

Pro Tip: For production deployments, prefer vLLM. It’s significantly faster than Transformers for batched constrained generation because Outlines integrates as a logits processor without per-token Python overhead.

Note: The OpenAI backend can’t do true constrained generation (no logits access). It uses prompt engineering + post-validation under the hood. For real constraint enforcement, you need a model whose logits you control.

Fix 2: JSON Generation With Pydantic

The cleanest way to define a JSON schema:

from pydantic import BaseModel, Field
import outlines

class User(BaseModel):
    name: str
    age: int = Field(ge=0, le=150)
    email: str | None = None

generator = outlines.generate.json(model, User)
result = generator("Extract: John Doe is 30 years old, email [email protected]")
print(result)
# User(name='John Doe', age=30, email='[email protected]')

Pydantic’s Field(ge=..., le=..., pattern=...) constraints translate to schema constraints that Outlines enforces during sampling.

For non-Pydantic schemas, pass raw JSON schema:

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
    },
    "required": ["name", "age"],
}

generator = outlines.generate.json(model, schema)
result = generator("...")
# result is a dict, not a Pydantic model

Common Mistake: Schemas with $ref to external definitions. Outlines doesn’t follow $ref outside the schema. Inline everything or use Pydantic’s auto-generated schema (which inlines via $defs).

Fix 3: Choice With Token-Aware Options

For multi-class classification with strict outputs:

generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
sentiment = generator("Review: 'Loved the food, hated the service.'")

Outlines constrains sampling so only tokens that lead to one of the choices are allowed. The output is guaranteed to be exactly one of the strings.

Common Mistake: Choices that are prefixes of each other:

generator = outlines.generate.choice(model, ["yes", "yes_strongly"])
# Ambiguous: after "yes", the model might continue to "_strongly" or stop.

Fix by adding terminating context or making choices unambiguous:

generator = outlines.generate.choice(model, ["yes", "yes_strongly", "no"])

If you need free-form output that starts with one of several phrases, use regex instead:

gen = outlines.generate.regex(model, r"(positive|negative|neutral)\b.*")

Fix 4: Regex Generation

For arbitrary patterns:

# Phone number:
gen = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")

# ISO date:
gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")

# IP address:
gen = outlines.generate.regex(model, r"(\d{1,3}\.){3}\d{1,3}")

The compilation cost for these is one-time per (model, regex) pair. Outlines caches it across calls for the same generator object — reuse generators rather than creating fresh ones in a loop.

# Slow — recompiles on every call:
for prompt in prompts:
    gen = outlines.generate.regex(model, pattern)
    out = gen(prompt)

# Fast — compile once:
gen = outlines.generate.regex(model, pattern)
for prompt in prompts:
    out = gen(prompt)

For very complex regex (lots of alternation, deep nesting), compilation can take 30s+. Simplify the pattern or use JSON/choice if applicable.

Fix 5: Streaming Constrained Output

For long outputs you want to display progressively:

generator = outlines.generate.text(model)
for chunk in generator.stream("Write a story about..."):
    print(chunk, end="", flush=True)

For constrained generators that have a stream method:

gen = outlines.generate.json(model, User)
for partial in gen.stream("Extract: ..."):
    print(partial)

Streaming a JSON generator yields tokens; you reconstruct the partial JSON yourself. Not all backends support streaming uniformly — Transformers and vLLM do; llama.cpp and OpenAI depend on the version.

Note: Constrained streaming is rarely a UX improvement for short structured outputs (a User object completes in <500ms). Streaming shines for long-form text or large arrays.

Fix 6: Multiple Generators Share One Model

Loading a 7B model takes 20-60s and 14 GB of VRAM. Don’t reload per generator:

# Load model once:
model = outlines.models.transformers("model-name")

# Reuse for multiple generators:
sentiment_gen = outlines.generate.choice(model, ["positive", "negative"])
date_gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")
user_gen = outlines.generate.json(model, User)

Each generator caches its FSM independently. Switching between them is free at runtime.

For batched throughput, prefer vLLM:

model = outlines.models.vllm("model-name", gpu_memory_utilization=0.9)
gen = outlines.generate.json(model, User)

# Batch processing — vLLM batches internally:
results = [gen(prompt) for prompt in prompts]

Fix 7: Sampling Parameters

Control creativity and length:

from outlines.samplers import multinomial, greedy

# Default is multinomial sampling at temp=1.0 — varied outputs.
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))

# Greedy: deterministic, picks highest-probability token at each step.
generator = outlines.generate.text(model, sampler=greedy())

# Generate with explicit max tokens:
result = generator("...", max_tokens=512)

For reproducible runs:

import torch

torch.manual_seed(42)
generator = outlines.generate.text(model, sampler=multinomial(temperature=0.7))

Common Mistake: Using multinomial(temperature=0.0). Multinomial with zero temperature is undefined; use greedy() for “always pick the most likely token.”

Fix 8: Prompt Templates

For chat models, format the prompt with the model’s expected template:

# Manual templating:
prompt = """<|user|>
Extract user info from: John Doe, age 30
<|end|>
<|assistant|>"""

result = generator(prompt)

Or use the tokenizer’s template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
messages = [
    {"role": "system", "content": "You extract user info as JSON."},
    {"role": "user", "content": "John Doe, age 30"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

result = generator(prompt)

For Outlines + vLLM, pass the chat template via the vLLM model:

model = outlines.models.vllm("model-name")
# vLLM applies the template automatically when you pass messages.

Pro Tip: Mismatched chat templates are the #1 reason a model that “should be smart” returns nonsense. Always check the model’s README for the exact template format.

Still Not Working?

A few less-obvious failures:

  • Generation starts but produces gibberish for the first 5 tokens. The model is warming up the cache. Discard the first run or use model.warmup() if your backend exposes it.
  • json generation returns None for optional fields. That’s correct — your schema marks them optional and the model didn’t fill them. Tighten the prompt if you want all fields populated.
  • vLLM Outlines plugin not picked up. vLLM needs the outlines-vllm integration. Install via pip install outlines[vllm] or check vLLM’s --guided-decoding-backend flag.
  • mlx backend returns empty strings on M-series Macs. MLX support is newer and less stable. Try transformers with device="mps" as a fallback while you debug.
  • max_tokens exceeded with no warning. Outlines silently truncates if the model hits the limit before completing the schema. For structured output, set max_tokens generously.
  • GPU runs out of memory immediately. Constrained generation has some overhead. Reduce gpu_memory_utilization (vLLM) or batch size, or use a smaller model.
  • outlines.generate.cfg for context-free grammars hangs. CFG support is experimental and the grammar must be in Lark format. Stick to regex/JSON for production.
  • Pydantic v1 schemas reject extra fields. Outlines 0.x targets Pydantic v2. Update your models or pin a compatible Outlines version.

For related LLM constraint and validation issues, see Instructor not working, DSPy not working, Pydantic validation error, and vLLM not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles