Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration
Quick Answer
How to fix Cloudflare Workers AI errors — env.AI binding setup, model ID format, text-generation streaming with ReadableStream, AI Gateway, Vectorize embeddings, region availability, and Neuron-based pricing.
The Error
You try to call env.AI and it’s undefined:
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: "Hello" }],
});
// TypeError: Cannot read properties of undefined (reading 'run')Or the model ID errors:
AI request failed: Model not foundOr streaming returns the entire response in one chunk:
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [...],
stream: true,
});
// Iterating reveals only one chunk with the full response.Or Vectorize queries return empty results despite having indexed data:
const matches = await env.VECTORIZE.query(vector, { topK: 10 });
console.log(matches.matches); // []Why This Happens
Cloudflare Workers AI is a Worker binding (env.AI) that runs hosted LLM models on Cloudflare’s GPU infrastructure. Plus auxiliary services: AI Gateway (caching/observability) and Vectorize (vector DB). Most issues map to:
- AI binding requires explicit declaration. Without
[ai]inwrangler.toml,env.AIis undefined. - Model IDs use a
@cf/...prefix. Different from OpenAI/Anthropic format. Models come and go — pin to specific IDs you’ve tested. - Streaming returns a
ReadableStream<Uint8Array>with SSE-formatted events. You need to parse the SSE wire format, not just iterate. - Vectorize is separate from Workers AI. Workers AI generates embeddings; Vectorize stores and queries them. Both bindings needed.
Fix 1: Bind Workers AI in wrangler.toml
name = "my-worker"
main = "src/worker.ts"
compatibility_date = "2026-05-01"
[ai]
binding = "AI"That’s it — no model lists, no per-region config. The binding is a single object that supports all Workers AI models.
Type generation:
wrangler typesGenerates worker-configuration.d.ts with the Ai type for the binding.
Basic call:
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What's the capital of France?" },
],
});
return Response.json(response);
},
};The response shape depends on the model. For text generation:
type TextGenerationResponse = {
response: string; // The generated text
};For embeddings:
type EmbeddingResponse = {
shape: number[];
data: number[][]; // Array of embedding vectors
};Pro Tip: Pin model IDs in a constants file. Cloudflare’s model catalog evolves; new versions of “the same model” get new IDs. Pinning avoids surprises.
Fix 2: Use the Right Model
Workers AI categorizes models by task:
Text generation (chat):
@cf/meta/llama-3.1-8b-instruct— general-purpose chat@cf/meta/llama-3.1-70b-instruct— larger, slower@cf/mistral/mistral-7b-instruct-v0.1— Mistral@cf/qwen/qwen1.5-14b-chat-awq— Qwen
Text embeddings:
@cf/baai/bge-small-en-v1.5— fast, smaller dims@cf/baai/bge-base-en-v1.5— balanced@cf/baai/bge-large-en-v1.5— higher quality
Image generation:
@cf/stabilityai/stable-diffusion-xl-base-1.0@cf/runwayml/stable-diffusion-v1-5-inpainting
Image-to-text:
@cf/llava-hf/llava-1.5-7b-hf@cf/unum/uform-gen2-qwen-500m
Speech-to-text:
@cf/openai/whisper
To list available models programmatically:
curl "https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/ai/models/search" \
-H "Authorization: Bearer $API_TOKEN"Common Mistake: Using OpenAI-style model IDs (gpt-4o, claude-3-sonnet). Workers AI runs Cloudflare-hosted models with their own IDs (@cf/meta/..., @cf/baai/...). Use the AI Gateway or direct API for OpenAI/Anthropic models.
Fix 3: Stream Responses Correctly
Streaming returns a ReadableStream<Uint8Array> with SSE-formatted events:
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [...],
stream: true,
});
// stream is a ReadableStream<Uint8Array>, NOT an iterable of objects.
return new Response(stream, {
headers: { "content-type": "text/event-stream" },
});For the simplest case, pipe the stream directly to the client — it’s already SSE-formatted.
To parse on the server (e.g. accumulate or transform):
const stream = await env.AI.run(model, { messages, stream: true });
const reader = stream.getReader();
const decoder = new TextDecoder();
let buffer = "";
let fullText = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() || ""; // Keep partial line
for (const line of lines) {
if (line.startsWith("data: ")) {
const data = line.slice(6).trim();
if (data === "[DONE]") break;
try {
const parsed = JSON.parse(data);
if (parsed.response) {
fullText += parsed.response;
// Use the token, e.g. send to client.
}
} catch {
// Ignore malformed lines
}
}
}
}For piping to a client with transformation:
const aiStream = await env.AI.run(model, { messages, stream: true });
const transformed = aiStream.pipeThrough(new TransformStream({
transform(chunk, controller) {
// Process each chunk if needed.
controller.enqueue(chunk);
},
}));
return new Response(transformed, {
headers: { "content-type": "text/event-stream" },
});Common Mistake: Treating the stream as an async iterable of message objects. It’s bytes — parse the SSE format yourself or pipe directly.
Fix 4: AI Gateway for Caching and Observability
AI Gateway sits between your Worker and any LLM provider (OpenAI, Anthropic, Workers AI). Adds caching, rate limiting, and observability:
# wrangler.toml
[ai]
binding = "AI"const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [...],
}, {
gateway: {
id: "my-ai-gateway", // Created in CF dashboard → AI Gateway
skipCache: false, // Use the cache if available
cacheTtl: 3600, // Cache for 1 hour
},
});Without a gateway, Workers AI calls go direct. With one, identical prompts get cached responses — huge cost savings for repeated queries.
For non-Workers-AI providers via the gateway:
const response = await fetch(
`https://gateway.ai.cloudflare.com/v1/${accountId}/my-gateway/openai/chat/completions`,
{
method: "POST",
headers: {
"content-type": "application/json",
"authorization": `Bearer ${env.OPENAI_API_KEY}`,
},
body: JSON.stringify({ model: "gpt-4o-mini", messages: [...] }),
},
);The gateway URL replaces api.openai.com — same OpenAI API shape, but caching/logging through Cloudflare.
Pro Tip: AI Gateway is free for basic use (cache, retries, analytics). The infrastructure cost is just the cache storage. For apps with overlapping prompts (chatbots with FAQs), the gateway pays for itself.
Fix 5: Embeddings + Vectorize for RAG
Vectorize is Cloudflare’s vector database. Combine with Workers AI for end-to-end RAG:
[ai]
binding = "AI"
[[vectorize]]
binding = "VECTORIZE"
index_name = "my-knowledge-base"Create the index:
wrangler vectorize create my-knowledge-base \
--dimensions=384 \
--metric=cosinedimensions must match your embedding model output (bge-small-en-v1.5 = 384, bge-base-en-v1.5 = 768, bge-large-en-v1.5 = 1024).
Insert documents:
async function indexDocument(env: Env, doc: { id: string; text: string }) {
// Generate embedding:
const embeddingResp = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [doc.text],
});
// Insert into Vectorize:
await env.VECTORIZE.upsert([
{
id: doc.id,
values: embeddingResp.data[0],
metadata: { text: doc.text },
},
]);
}Query for similar documents:
async function search(env: Env, query: string) {
const queryEmbedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [query],
});
const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 5,
returnMetadata: true,
});
return results.matches.map((m) => ({
score: m.score,
text: m.metadata?.text,
}));
}For RAG (retrieval + generation):
async function answer(env: Env, question: string) {
const contexts = await search(env, question);
const prompt = contexts.map((c) => c.text).join("\n\n");
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: `Answer based on this context:\n${prompt}` },
{ role: "user", content: question },
],
});
return response.response;
}Common Mistake: Mismatched dimensions between the model and Vectorize index. If you insert 384-dim vectors into a 768-dim index, insertion fails (or worse, silently truncates). Always match the model’s output dimensions.
Fix 6: Cost — Neurons
Workers AI is priced in Neurons (Cloudflare’s compute unit). Each model has a Neuron cost per request. Free tier includes a daily quota (~10K Neurons/day at the time of writing).
To monitor:
- Dashboard → Workers AI → Usage.
Strategies to control cost:
- Cache via AI Gateway. Identical prompts return cached responses (~0 cost).
- Pick smaller models.
llama-3.1-8bis much cheaper thanllama-3.1-70bper request. - Batch embedding requests. Embeddings models accept arrays of text:
const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: ["text 1", "text 2", "text 3"], // Batched
});
// response.data: number[][] — one embedding per inputOne call, multiple embeddings — much cheaper than N separate calls.
For high-volume production, consider the AI Gateway’s per-key rate limits to prevent runaway costs:
- Dashboard → AI Gateway → your gateway → Settings → Rate limiting.
Pro Tip: Set up cost alerts in Cloudflare’s billing settings. Workers AI bills can surprise you on a chatty app — alerts give early warning.
Fix 7: Function Calling and Tool Use
Some Workers AI models support function calling (e.g. Llama 3.1):
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
tools: [
{
type: "function",
function: {
name: "get_weather",
description: "Get the current weather for a city",
parameters: {
type: "object",
properties: {
city: { type: "string" },
},
required: ["city"],
},
},
},
],
});
// If the model wants to call a tool:
if (response.tool_calls) {
for (const call of response.tool_calls) {
const args = JSON.parse(call.function.arguments);
// Execute the function, e.g. fetch weather API.
// Then send the result back in the next message.
}
}Function calling support varies by model — Llama 3.1+ has it; older models may not. Check the model’s docs.
Common Mistake: Assuming function calling response format matches OpenAI’s. Workers AI’s format is similar but not identical. Test before assuming compatibility.
Fix 8: Region Availability and Latency
Workers AI runs in select Cloudflare data centers with GPU hardware. Not every region has every model. Cloudflare routes requests to the nearest available data center automatically.
For latency-sensitive apps:
- Test from your users’ regions. Workers AI’s response time varies by available capacity in that region.
- Consider
@cf/baai/bge-small-en-v1.5(smaller models = faster, more available). - Use AI Gateway caching aggressively to avoid hitting AI for repeated queries.
For dev environments outside Cloudflare’s GPU regions, wrangler dev --remote is the only way to test — local Miniflare doesn’t simulate Workers AI (no GPU).
Common Mistake: Benchmarking in dev mode and assuming production latency. Always benchmark with wrangler dev --remote or in a deploy preview.
Still Not Working?
A few less-obvious failures:
Model is currently overloaded. GPU capacity at the region is exhausted. Retry with backoff or fall back to a smaller model.- Empty embedding result. Input was empty or only whitespace. Validate before calling.
AI is not defineddespite binding.wrangler.tomlchange didn’t apply — re-runwrangler deploy. Checkenv.AIis also typed in yourEnvinterface.- Streaming first chunk takes seconds. Cold start of the GPU pipeline. Subsequent streams are fast. Use AI Gateway to cache common patterns.
- Vectorize query returns matches with wrong text. Metadata wasn’t included in the upsert. Use
metadata: { ... }when inserting; query withreturnMetadata: true. - Function call response missing tool_calls. Model didn’t decide to call a tool. Make the prompt more directive (“Call the weather function for…”).
Invalid model: .... Typo in model ID or model deprecated. Check the current model catalog in the Cloudflare dashboard.- Embeddings cache stale. AI Gateway cache key is by full prompt + parameters. Subtle changes (extra space, different temperature) invalidate the cache.
For related Cloudflare and AI/LLM issues, see Cloudflare D1 not working, Cloudflare R2 not working, Cloudflare Durable Objects not working, and LiteLLM not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Bedrock Not Working — Model Access, IAM, Converse API, Streaming, and Cross-Region
How to fix AWS Bedrock errors — AccessDeniedException for model access, bedrock vs bedrock-runtime client, Converse vs InvokeModel API, streaming with ConverseStream, regional availability, and Knowledge Bases setup.
Fix: Cloudflare Durable Objects Not Working — ID Strategy, Storage API, WebSocket Hibernation, Alarms
How to fix Cloudflare Durable Objects errors — idFromName vs newUniqueId, Storage transactions, blockConcurrencyWhile, WebSocket Hibernation API, alarms, migrations, and class binding setup.
Fix: Cloudflare Pages Not Working — Build Output, Functions Routing, _redirects, and Bindings
How to fix Cloudflare Pages errors — build output directory mismatch, Functions in /functions/, _redirects vs _headers, compatibility flags, env per branch, D1/R2/KV bindings, and Direct Upload alternatives.
Fix: Cloudflare Queues Not Working — Producer Binding, Consumer Worker, Batching, and Dead Letter
How to fix Cloudflare Queues errors — producer queue.send not delivering, consumer not invoking, ack/retry/DLQ patterns, batch size limits, max_retries, content type pitfalls, and local dev with wrangler.