Skip to content

Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration

FixDevs ·

Quick Answer

How to fix Cloudflare Workers AI errors — env.AI binding setup, model ID format, text-generation streaming with ReadableStream, AI Gateway, Vectorize embeddings, region availability, and Neuron-based pricing.

The Error

You try to call env.AI and it’s undefined:

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Hello" }],
});
// TypeError: Cannot read properties of undefined (reading 'run')

Or the model ID errors:

AI request failed: Model not found

Or streaming returns the entire response in one chunk:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
  stream: true,
});
// Iterating reveals only one chunk with the full response.

Or Vectorize queries return empty results despite having indexed data:

const matches = await env.VECTORIZE.query(vector, { topK: 10 });
console.log(matches.matches);  // []

Why This Happens

Cloudflare Workers AI is a Worker binding (env.AI) that runs hosted LLM models on Cloudflare’s GPU infrastructure. Plus auxiliary services: AI Gateway (caching/observability) and Vectorize (vector DB). Most issues map to:

  • AI binding requires explicit declaration. Without [ai] in wrangler.toml, env.AI is undefined.
  • Model IDs use a @cf/... prefix. Different from OpenAI/Anthropic format. Models come and go — pin to specific IDs you’ve tested.
  • Streaming returns a ReadableStream<Uint8Array> with SSE-formatted events. You need to parse the SSE wire format, not just iterate.
  • Vectorize is separate from Workers AI. Workers AI generates embeddings; Vectorize stores and queries them. Both bindings needed.

Fix 1: Bind Workers AI in wrangler.toml

name = "my-worker"
main = "src/worker.ts"
compatibility_date = "2026-05-01"

[ai]
binding = "AI"

That’s it — no model lists, no per-region config. The binding is a single object that supports all Workers AI models.

Type generation:

wrangler types

Generates worker-configuration.d.ts with the Ai type for the binding.

Basic call:

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "What's the capital of France?" },
      ],
    });
    return Response.json(response);
  },
};

The response shape depends on the model. For text generation:

type TextGenerationResponse = {
  response: string;  // The generated text
};

For embeddings:

type EmbeddingResponse = {
  shape: number[];
  data: number[][];  // Array of embedding vectors
};

Pro Tip: Pin model IDs in a constants file. Cloudflare’s model catalog evolves; new versions of “the same model” get new IDs. Pinning avoids surprises.

Fix 2: Use the Right Model

Workers AI categorizes models by task:

Text generation (chat):

  • @cf/meta/llama-3.1-8b-instruct — general-purpose chat
  • @cf/meta/llama-3.1-70b-instruct — larger, slower
  • @cf/mistral/mistral-7b-instruct-v0.1 — Mistral
  • @cf/qwen/qwen1.5-14b-chat-awq — Qwen

Text embeddings:

  • @cf/baai/bge-small-en-v1.5 — fast, smaller dims
  • @cf/baai/bge-base-en-v1.5 — balanced
  • @cf/baai/bge-large-en-v1.5 — higher quality

Image generation:

  • @cf/stabilityai/stable-diffusion-xl-base-1.0
  • @cf/runwayml/stable-diffusion-v1-5-inpainting

Image-to-text:

  • @cf/llava-hf/llava-1.5-7b-hf
  • @cf/unum/uform-gen2-qwen-500m

Speech-to-text:

  • @cf/openai/whisper

To list available models programmatically:

curl "https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/ai/models/search" \
  -H "Authorization: Bearer $API_TOKEN"

Common Mistake: Using OpenAI-style model IDs (gpt-4o, claude-3-sonnet). Workers AI runs Cloudflare-hosted models with their own IDs (@cf/meta/..., @cf/baai/...). Use the AI Gateway or direct API for OpenAI/Anthropic models.

Fix 3: Stream Responses Correctly

Streaming returns a ReadableStream<Uint8Array> with SSE-formatted events:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
  stream: true,
});

// stream is a ReadableStream<Uint8Array>, NOT an iterable of objects.

return new Response(stream, {
  headers: { "content-type": "text/event-stream" },
});

For the simplest case, pipe the stream directly to the client — it’s already SSE-formatted.

To parse on the server (e.g. accumulate or transform):

const stream = await env.AI.run(model, { messages, stream: true });

const reader = stream.getReader();
const decoder = new TextDecoder();
let buffer = "";
let fullText = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop() || "";  // Keep partial line
  
  for (const line of lines) {
    if (line.startsWith("data: ")) {
      const data = line.slice(6).trim();
      if (data === "[DONE]") break;
      try {
        const parsed = JSON.parse(data);
        if (parsed.response) {
          fullText += parsed.response;
          // Use the token, e.g. send to client.
        }
      } catch {
        // Ignore malformed lines
      }
    }
  }
}

For piping to a client with transformation:

const aiStream = await env.AI.run(model, { messages, stream: true });

const transformed = aiStream.pipeThrough(new TransformStream({
  transform(chunk, controller) {
    // Process each chunk if needed.
    controller.enqueue(chunk);
  },
}));

return new Response(transformed, {
  headers: { "content-type": "text/event-stream" },
});

Common Mistake: Treating the stream as an async iterable of message objects. It’s bytes — parse the SSE format yourself or pipe directly.

Fix 4: AI Gateway for Caching and Observability

AI Gateway sits between your Worker and any LLM provider (OpenAI, Anthropic, Workers AI). Adds caching, rate limiting, and observability:

# wrangler.toml
[ai]
binding = "AI"
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [...],
}, {
  gateway: {
    id: "my-ai-gateway",         // Created in CF dashboard → AI Gateway
    skipCache: false,             // Use the cache if available
    cacheTtl: 3600,               // Cache for 1 hour
  },
});

Without a gateway, Workers AI calls go direct. With one, identical prompts get cached responses — huge cost savings for repeated queries.

For non-Workers-AI providers via the gateway:

const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/my-gateway/openai/chat/completions`,
  {
    method: "POST",
    headers: {
      "content-type": "application/json",
      "authorization": `Bearer ${env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({ model: "gpt-4o-mini", messages: [...] }),
  },
);

The gateway URL replaces api.openai.com — same OpenAI API shape, but caching/logging through Cloudflare.

Pro Tip: AI Gateway is free for basic use (cache, retries, analytics). The infrastructure cost is just the cache storage. For apps with overlapping prompts (chatbots with FAQs), the gateway pays for itself.

Fix 5: Embeddings + Vectorize for RAG

Vectorize is Cloudflare’s vector database. Combine with Workers AI for end-to-end RAG:

[ai]
binding = "AI"

[[vectorize]]
binding = "VECTORIZE"
index_name = "my-knowledge-base"

Create the index:

wrangler vectorize create my-knowledge-base \
  --dimensions=384 \
  --metric=cosine

dimensions must match your embedding model output (bge-small-en-v1.5 = 384, bge-base-en-v1.5 = 768, bge-large-en-v1.5 = 1024).

Insert documents:

async function indexDocument(env: Env, doc: { id: string; text: string }) {
  // Generate embedding:
  const embeddingResp = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [doc.text],
  });
  
  // Insert into Vectorize:
  await env.VECTORIZE.upsert([
    {
      id: doc.id,
      values: embeddingResp.data[0],
      metadata: { text: doc.text },
    },
  ]);
}

Query for similar documents:

async function search(env: Env, query: string) {
  const queryEmbedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [query],
  });
  
  const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 5,
    returnMetadata: true,
  });
  
  return results.matches.map((m) => ({
    score: m.score,
    text: m.metadata?.text,
  }));
}

For RAG (retrieval + generation):

async function answer(env: Env, question: string) {
  const contexts = await search(env, question);
  const prompt = contexts.map((c) => c.text).join("\n\n");
  
  const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      { role: "system", content: `Answer based on this context:\n${prompt}` },
      { role: "user", content: question },
    ],
  });
  
  return response.response;
}

Common Mistake: Mismatched dimensions between the model and Vectorize index. If you insert 384-dim vectors into a 768-dim index, insertion fails (or worse, silently truncates). Always match the model’s output dimensions.

Fix 6: Cost — Neurons

Workers AI is priced in Neurons (Cloudflare’s compute unit). Each model has a Neuron cost per request. Free tier includes a daily quota (~10K Neurons/day at the time of writing).

To monitor:

  • Dashboard → Workers AI → Usage.

Strategies to control cost:

  • Cache via AI Gateway. Identical prompts return cached responses (~0 cost).
  • Pick smaller models. llama-3.1-8b is much cheaper than llama-3.1-70b per request.
  • Batch embedding requests. Embeddings models accept arrays of text:
const response = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["text 1", "text 2", "text 3"],  // Batched
});
// response.data: number[][] — one embedding per input

One call, multiple embeddings — much cheaper than N separate calls.

For high-volume production, consider the AI Gateway’s per-key rate limits to prevent runaway costs:

  • Dashboard → AI Gateway → your gateway → Settings → Rate limiting.

Pro Tip: Set up cost alerts in Cloudflare’s billing settings. Workers AI bills can surprise you on a chatty app — alerts give early warning.

Fix 7: Function Calling and Tool Use

Some Workers AI models support function calling (e.g. Llama 3.1):

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get the current weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
});

// If the model wants to call a tool:
if (response.tool_calls) {
  for (const call of response.tool_calls) {
    const args = JSON.parse(call.function.arguments);
    // Execute the function, e.g. fetch weather API.
    // Then send the result back in the next message.
  }
}

Function calling support varies by model — Llama 3.1+ has it; older models may not. Check the model’s docs.

Common Mistake: Assuming function calling response format matches OpenAI’s. Workers AI’s format is similar but not identical. Test before assuming compatibility.

Fix 8: Region Availability and Latency

Workers AI runs in select Cloudflare data centers with GPU hardware. Not every region has every model. Cloudflare routes requests to the nearest available data center automatically.

For latency-sensitive apps:

  • Test from your users’ regions. Workers AI’s response time varies by available capacity in that region.
  • Consider @cf/baai/bge-small-en-v1.5 (smaller models = faster, more available).
  • Use AI Gateway caching aggressively to avoid hitting AI for repeated queries.

For dev environments outside Cloudflare’s GPU regions, wrangler dev --remote is the only way to test — local Miniflare doesn’t simulate Workers AI (no GPU).

Common Mistake: Benchmarking in dev mode and assuming production latency. Always benchmark with wrangler dev --remote or in a deploy preview.

Still Not Working?

A few less-obvious failures:

  • Model is currently overloaded. GPU capacity at the region is exhausted. Retry with backoff or fall back to a smaller model.
  • Empty embedding result. Input was empty or only whitespace. Validate before calling.
  • AI is not defined despite binding. wrangler.toml change didn’t apply — re-run wrangler deploy. Check env.AI is also typed in your Env interface.
  • Streaming first chunk takes seconds. Cold start of the GPU pipeline. Subsequent streams are fast. Use AI Gateway to cache common patterns.
  • Vectorize query returns matches with wrong text. Metadata wasn’t included in the upsert. Use metadata: { ... } when inserting; query with returnMetadata: true.
  • Function call response missing tool_calls. Model didn’t decide to call a tool. Make the prompt more directive (“Call the weather function for…”).
  • Invalid model: .... Typo in model ID or model deprecated. Check the current model catalog in the Cloudflare dashboard.
  • Embeddings cache stale. AI Gateway cache key is by full prompt + parameters. Subtle changes (extra space, different temperature) invalidate the cache.

For related Cloudflare and AI/LLM issues, see Cloudflare D1 not working, Cloudflare R2 not working, Cloudflare Durable Objects not working, and LiteLLM not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles