Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration

Q: How do I fix "Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration"?

How to fix Cloudflare Workers AI errors — env.AI binding setup, model ID format, text-generation streaming with ReadableStream, AI Gateway, Vectorize embeddings, region availability, and Neuron-based pricing.

The Error

You try to call env.AI and it’s undefined:

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Hello" }],
});
// TypeError: Cannot read properties of undefined (reading 'run')

Or the model ID errors:

AI request failed: Model not found

Or streaming returns the entire response in one chunk:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
  stream: true,
});
// Iterating reveals only one chunk with the full response.

Or Vectorize queries return empty results despite having indexed data:

const matches = await env.VECTORIZE.query(vector, { topK: 10 });
console.log(matches.matches);  // []

Why This Happens

Cloudflare Workers AI is a Worker binding (env.AI) that runs hosted LLM models on Cloudflare’s GPU-enabled data centers. Around it sit two auxiliary services: AI Gateway (caching, rate limiting, and observability for any LLM provider) and Vectorize (Cloudflare’s vector database for embeddings). The wins are real — inference at the edge, no key management for first-party models, and a Worker-friendly streaming API — but each of these surfaces has its own configuration that doesn’t transfer from OpenAI or Anthropic mental models.

The most common failures map to four categories. The AI binding requires explicit declaration in wrangler.toml — without [ai] binding = "AI", env.AI is undefined, and your code crashes on the first method call. Model IDs use Cloudflare’s own @cf/<vendor>/<model> format, which is similar to but not interchangeable with OpenAI’s gpt-4o or Anthropic’s claude-3-sonnet; passing the wrong format returns “Model not found.” Streaming returns a ReadableStream<Uint8Array> of SSE-formatted events, not an async iterable of message objects — trying to for await over it gives back a single chunk containing the whole response. And Vectorize is a separate binding from Workers AI; you generate embeddings via env.AI.run(...) and store them via env.VECTORIZE.upsert(...), and both need to be bound.

The deeper traps come from how Workers AI scales. Models are not available in every Cloudflare data center — Cloudflare routes your request to the nearest GPU region, which can add 50-300ms of routing latency. “Free tier” is metered in Neurons rather than tokens, and the daily quota is small (~10K Neurons) — easy to exhaust without realizing. Account-level model permissions can override Worker-level bindings, so a request that works for the account owner may fail under a service token with restricted scopes. The fixes below cover each of these in turn.

Diagnostic Timeline

Trace a “my Worker AI streaming endpoint returns one giant chunk and then closes” failure.

Minute 0 — first suspicion: check the API key. With OpenAI mental models you’d verify OPENAI_API_KEY next. Workers AI doesn’t use a per-request API key — the binding handles auth automatically through your account. There’s no key to check.

Minute 3 — first evidence: log env.AI. Add console.log(typeof env.AI). If it logs undefined, the binding wasn’t declared in wrangler.toml or you forgot to redeploy after editing it. If it logs object, the binding exists and the failure is downstream.

Minute 6 — next check: model availability per region. Open the Cloudflare dashboard → Workers AI → Models. Verify the model ID you’re passing is listed and not marked deprecated. Then check the “Regional availability” column. If your request is routing through a region that doesn’t host that model, Cloudflare may queue or fail it — even for an account that has access elsewhere. Switch to a model marked “globally available” to confirm the region is the issue.

Minute 9 — discriminating evidence: account-level scope. A team account often has model access scoped per-user. Open Account → API Tokens → check that the token (or the Worker binding) has Workers AI: Read and the specific model class permission. A scope mismatch returns a 403 that the runtime sometimes serializes as “Model not found.”

Minute 12 — actual root cause: free tier Neuron quota exhausted. Open Workers AI → Usage. The daily Neuron quota has hit its cap. New requests succeed but get a degraded “single chunk” response — the streaming pipeline returns the cached or queued aggregate instead of token-by-token output. Wait for the daily reset, upgrade to a paid plan, or move to AI Gateway with prompt caching to amortize cost. Streaming returns to normal once you have headroom.

Fix 1: Bind Workers AI in `wrangler.toml`

name = "my-worker"
main = "src/worker.ts"
compatibility_date = "2026-05-01"

[ai]
binding = "AI"

That’s it — no model lists, no per-region config. The binding is a single object that supports all Workers AI models.

Type generation:

wrangler types

Generates worker-configuration.d.ts with the Ai type for the binding.

Basic call:

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "What's the capital of France?" },
      ],
    });
    return Response.json(response);
  },
};

The response shape depends on the model. For text generation:

type TextGenerationResponse = {
  response: string;  // The generated text
};

For embeddings:

type EmbeddingResponse = {
  shape: number[];
  data: number[][];  // Array of embedding vectors
};

Pro Tip: Pin model IDs in a constants file. Cloudflare’s model catalog evolves; new versions of “the same model” get new IDs. Pinning avoids surprises.

Fix 2: Use the Right Model

Workers AI categorizes models by task:

Text generation (chat):

@cf/meta/llama-3.1-8b-instruct — general-purpose chat
@cf/meta/llama-3.1-70b-instruct — larger, slower
@cf/mistral/mistral-7b-instruct-v0.1 — Mistral
@cf/qwen/qwen1.5-14b-chat-awq — Qwen

Text embeddings:

@cf/baai/bge-small-en-v1.5 — fast, smaller dims
@cf/baai/bge-base-en-v1.5 — balanced
@cf/baai/bge-large-en-v1.5 — higher quality

Image generation:

@cf/stabilityai/stable-diffusion-xl-base-1.0
@cf/runwayml/stable-diffusion-v1-5-inpainting

Image-to-text:

@cf/llava-hf/llava-1.5-7b-hf
@cf/unum/uform-gen2-qwen-500m

Speech-to-text:

@cf/openai/whisper

To list available models programmatically:

curl "https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/ai/models/search" \
  -H "Authorization: Bearer $API_TOKEN"

Common Mistake: Using OpenAI-style model IDs (gpt-4o, claude-3-sonnet). Workers AI runs Cloudflare-hosted models with their own IDs (@cf/meta/..., @cf/baai/...). Use the AI Gateway or direct API for OpenAI/Anthropic models.

Fix 3: Stream Responses Correctly

Streaming returns a ReadableStream<Uint8Array> with SSE-formatted events:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
  stream: true,
});

// stream is a ReadableStream<Uint8Array>, NOT an iterable of objects.

return new Response(stream, {
  headers: { "content-type": "text/event-stream" },
});

For the simplest case, pipe the stream directly to the client — it’s already SSE-formatted.

To parse on the server (e.g. accumulate or transform):

const stream = await env.AI.run(model, { messages, stream: true });

const reader = stream.getReader();
const decoder = new TextDecoder();
let buffer = "";
let fullText = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop() || "";  // Keep partial line
  
  for (const line of lines) {
    if (line.startsWith("data: ")) {
      const data = line.slice(6).trim();
      if (data === "[DONE]") break;
      try {
        const parsed = JSON.parse(data);
        if (parsed.response) {
          fullText += parsed.response;
          // Use the token, e.g. send to client.
        }
      } catch {
        // Ignore malformed lines
      }
    }
  }
}

For piping to a client with transformation:

const aiStream = await env.AI.run(model, { messages, stream: true });

const transformed = aiStream.pipeThrough(new TransformStream({
  transform(chunk, controller) {
    // Process each chunk if needed.
    controller.enqueue(chunk);
  },
}));

return new Response(transformed, {
  headers: { "content-type": "text/event-stream" },
});

Common Mistake: Treating the stream as an async iterable of message objects. It’s bytes — parse the SSE format yourself or pipe directly.

Fix 4: AI Gateway for Caching and Observability

AI Gateway sits between your Worker and any LLM provider (OpenAI, Anthropic, Workers AI). Adds caching, rate limiting, and observability:

# wrangler.toml
[ai]
binding = "AI"

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
}, {
  gateway: {
    id: "my-ai-gateway",         // Created in CF dashboard → AI Gateway
    skipCache: false,             // Use the cache if available
    cacheTtl: 3600,               // Cache for 1 hour
  },
});

Without a gateway, Workers AI calls go direct. With one, identical prompts get cached responses — huge cost savings for repeated queries.

For non-Workers-AI providers via the gateway:

const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/my-gateway/openai/chat/completions`,
  {
    method: "POST",
    headers: {
      "content-type": "application/json",
      "authorization": `Bearer ${env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({ model: "gpt-4o-mini", messages: [...] }),
  },
);

The gateway URL replaces api.openai.com — same OpenAI API shape, but caching/logging through Cloudflare.

Pro Tip: AI Gateway is free for basic use (cache, retries, analytics). The infrastructure cost is just the cache storage. For apps with overlapping prompts (chatbots with FAQs), the gateway pays for itself.

Fix 5: Embeddings + Vectorize for RAG

Vectorize is Cloudflare’s vector database. Combine with Workers AI for end-to-end RAG:

[ai]
binding = "AI"

[[vectorize]]
binding = "VECTORIZE"
index_name = "my-knowledge-base"

Create the index:

wrangler vectorize create my-knowledge-base \
  --dimensions=384 \
  --metric=cosine

dimensions must match your embedding model output (bge-small-en-v1.5 = 384, bge-base-en-v1.5 = 768, bge-large-en-v1.5 = 1024).

Insert documents:

async function indexDocument(env: Env, doc: { id: string; text: string }) {
  // Generate embedding:
  const embeddingResp = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [doc.text],
  });
  
  // Insert into Vectorize:
  await env.VECTORIZE.upsert([
    {
      id: doc.id,
      values: embeddingResp.data[0],
      metadata: { text: doc.text },
    },
  ]);
}

Query for similar documents:

async function search(env: Env, query: string) {
  const queryEmbedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [query],
  });
  
  const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 5,
    returnMetadata: true,
  });
  
  return results.matches.map((m) => ({
    score: m.score,
    text: m.metadata?.text,
  }));
}

For RAG (retrieval + generation):

async function answer(env: Env, question: string) {
  const contexts = await search(env, question);
  const prompt = contexts.map((c) => c.text).join("\n\n");
  
  const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      { role: "system", content: `Answer based on this context:\n${prompt}` },
      { role: "user", content: question },
    ],
  });
  
  return response.response;
}

Common Mistake: Mismatched dimensions between the model and Vectorize index. If you insert 384-dim vectors into a 768-dim index, insertion fails (or worse, silently truncates). Always match the model’s output dimensions.

Fix 6: Cost — Neurons

Workers AI is priced in Neurons (Cloudflare’s compute unit). Each model has a Neuron cost per request. Free tier includes a daily quota (~10K Neurons/day at the time of writing).

To monitor:

Dashboard → Workers AI → Usage.

Strategies to control cost:

Cache via AI Gateway. Identical prompts return cached responses (~0 cost).
Pick smaller models. llama-3.1-8b is much cheaper than llama-3.1-70b per request.
Batch embedding requests. Embeddings models accept arrays of text:

const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["text 1", "text 2", "text 3"],  // Batched
});
// response.data: number[][] — one embedding per input

One call, multiple embeddings — much cheaper than N separate calls.

For high-volume production, consider the AI Gateway’s per-key rate limits to prevent runaway costs:

Dashboard → AI Gateway → your gateway → Settings → Rate limiting.

Pro Tip: Set up cost alerts in Cloudflare’s billing settings. Workers AI bills can surprise you on a chatty app — alerts give early warning.

Fix 7: Function Calling and Tool Use

Some Workers AI models support function calling (e.g. Llama 3.1):

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get the current weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
});

// If the model wants to call a tool:
if (response.tool_calls) {
  for (const call of response.tool_calls) {
    const args = JSON.parse(call.function.arguments);
    // Execute the function, e.g. fetch weather API.
    // Then send the result back in the next message.
  }
}

Function calling support varies by model — Llama 3.1+ has it; older models may not. Check the model’s docs.

Common Mistake: Assuming function calling response format matches OpenAI’s. Workers AI’s format is similar but not identical. Test before assuming compatibility.

Fix 8: Region Availability and Latency

Workers AI runs in select Cloudflare data centers with GPU hardware. Not every region has every model. Cloudflare routes requests to the nearest available data center automatically.

For latency-sensitive apps:

Test from your users’ regions. Workers AI’s response time varies by available capacity in that region.
Consider @cf/baai/bge-small-en-v1.5 (smaller models = faster, more available).
Use AI Gateway caching aggressively to avoid hitting AI for repeated queries.

For dev environments outside Cloudflare’s GPU regions, wrangler dev --remote is the only way to test — local Miniflare doesn’t simulate Workers AI (no GPU).

Common Mistake: Benchmarking in dev mode and assuming production latency. Always benchmark with wrangler dev --remote or in a deploy preview.

Still Not Working?

A few less-obvious failures:

Model is currently overloaded. GPU capacity at the region is exhausted. Retry with backoff or fall back to a smaller model.
Empty embedding result. Input was empty or only whitespace. Validate before calling.
AI is not defined despite binding. wrangler.toml change didn’t apply — re-run wrangler deploy. Check env.AI is also typed in your Env interface.
Streaming first chunk takes seconds. Cold start of the GPU pipeline. Subsequent streams are fast. Use AI Gateway to cache common patterns.
Vectorize query returns matches with wrong text. Metadata wasn’t included in the upsert. Use metadata: { ... } when inserting; query with returnMetadata: true.
Function call response missing tool_calls. Model didn’t decide to call a tool. Make the prompt more directive (“Call the weather function for…”).
Invalid model: .... Typo in model ID or model deprecated. Check the current model catalog in the Cloudflare dashboard.
Embeddings cache stale. AI Gateway cache key is by full prompt + parameters. Subtle changes (extra space, different temperature) invalidate the cache.
Account-level vs Worker-level scope mismatch. A user-scoped Workers AI permission can let dashboard tests pass while a Worker binding using a more restricted service token fails with Model not found. Recreate the binding with the API token from the dashboard, or grant the service token the matching scope.
Free tier Neuron quota silently caps streaming. When the daily quota is exhausted, requests don’t fail outright — they return a degraded single-chunk response that looks like a bug in your client parsing. Watch the Usage tab and upgrade or add AI Gateway caching before the next reset.
wrangler dev local mode never reaches Workers AI. Miniflare cannot simulate GPU inference. Use wrangler dev --remote (or just wrangler deploy to a preview) for any real testing of env.AI behavior.

For related Cloudflare and AI/LLM issues, see Cloudflare D1 not working, Cloudflare R2 not working, Cloudflare Durable Objects not working, and LiteLLM not working.

Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration

The Error

Why This Happens

Diagnostic Timeline

Fix 1: Bind Workers AI in `wrangler.toml`

Fix 2: Use the Right Model

Fix 3: Stream Responses Correctly

Fix 4: AI Gateway for Caching and Observability

Fix 5: Embeddings + Vectorize for RAG

Fix 6: Cost — Neurons

Fix 7: Function Calling and Tool Use

Fix 8: Region Availability and Latency

Still Not Working?

Related Articles

Fix: AWS Bedrock Not Working — Model Access, IAM, Converse API, Streaming, and Cross-Region

Fix: Cloudflare Durable Objects Not Working — ID Strategy, Storage API, WebSocket Hibernation, Alarms

Fix: Cloudflare Pages Not Working — Build Output, Functions Routing, _redirects, and Bindings

Fix: Cloudflare Queues Not Working — Producer Binding, Consumer Worker, Batching, and Dead Letter

The Error

Why This Happens

Diagnostic Timeline

Fix 1: Bind Workers AI in wrangler.toml

Fix 2: Use the Right Model

Fix 3: Stream Responses Correctly

Fix 4: AI Gateway for Caching and Observability

Fix 5: Embeddings + Vectorize for RAG

Fix 6: Cost — Neurons

Fix 7: Function Calling and Tool Use

Fix 8: Region Availability and Latency

Still Not Working?

Related Articles

Fix: AWS Bedrock Not Working — Model Access, IAM, Converse API, Streaming, and Cross-Region

Fix: Cloudflare Durable Objects Not Working — ID Strategy, Storage API, WebSocket Hibernation, Alarms

Fix: Cloudflare Pages Not Working — Build Output, Functions Routing, _redirects, and Bindings

Fix: Cloudflare Queues Not Working — Producer Binding, Consumer Worker, Batching, and Dead Letter

Fix 1: Bind Workers AI in `wrangler.toml`