Fix: Replicate Not Working — Model Versions, Prediction Polling, Webhooks, and Cog Build
Quick Answer
How to fix Replicate API errors — model version ID required, prediction polling vs streaming, webhook signature verification, file inputs and HTTPS URLs, cold start latency, Cog deployment, and deployments vs predictions.
The Error
You call the Replicate API and get a version error:
HTTPError: 422 Client Error: Unprocessable Entity
{"detail": "version is required"}Or prediction.output is None:
prediction = replicate.predictions.create(...)
print(prediction.output) # None
print(prediction.status) # "starting"Or your webhook never fires:
[webhook handler] No predictions arriving...Or cog build fails:
cog: error: building image failed: layer ... too largeWhy This Happens
Replicate hosts ML models accessible via HTTP API. Most issues map to:
- Predictions are async.
predictions.createreturns immediately with a job instartingstate. You either poll, use webhooks, or call the conveniencerun()method (which polls internally). - Model versions are required. A model URL like
replicate/stable-diffusionis ambiguous — versions are commit-like IDs. Either pin a version or use the helper that picks the latest. - File inputs need URLs or base64. Local file paths don’t work over HTTP. Either upload to your own storage and pass the URL, or base64-encode inline (size-limited).
- Cog (Replicate’s containerization tool) builds Docker images of your ML code. Big images, GPU dependencies, slow builds.
Fix 1: Specify the Model Version
import replicate
# Use the model:version shorthand:
output = replicate.run(
"stability-ai/stable-diffusion-3:abcdef0123456789",
input={"prompt": "a cat on a roof", "width": 1024, "height": 1024},
)
print(output)stability-ai/stable-diffusion-3:abcdef0123456789 — username/model:version_id. The version ID is a hash of the deployed model.
To find the latest version:
model = replicate.models.get("stability-ai/stable-diffusion-3")
latest_version = model.latest_version.id
print(latest_version)Or skip the version (uses model.latest_version):
output = replicate.run(
"stability-ai/stable-diffusion-3", # No version — uses latest
input={"prompt": "..."},
)For production, pin specific versions to avoid surprises:
STABLE_DIFFUSION_VERSION = "stability-ai/stable-diffusion-3:abcdef0123456789"
output = replicate.run(STABLE_DIFFUSION_VERSION, input={...})For Node:
import Replicate from "replicate";
const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });
const output = await replicate.run(
"stability-ai/stable-diffusion-3:abcdef0123456789",
{ input: { prompt: "a cat" } },
);Pro Tip: Pin versions per environment. Dev can track latest; production should pin so a model update doesn’t accidentally change your output.
Fix 2: run() vs Manual Polling
The run() method blocks until the prediction completes:
output = replicate.run(model_version, input={...})
# Returns the final output, polling internally.For more control (e.g. show progress to users), call predictions.create and poll:
prediction = replicate.predictions.create(
version=model_version,
input={"prompt": "..."},
)
while prediction.status not in ("succeeded", "failed", "canceled"):
time.sleep(1)
prediction.reload()
print(prediction.status) # starting → processing → succeeded
if prediction.status == "succeeded":
print(prediction.output)
elif prediction.status == "failed":
print(prediction.error)For Node:
const prediction = await replicate.predictions.create({
version: "abcdef0123456789",
input: { prompt: "..." },
});
let status = prediction.status;
while (status === "starting" || status === "processing") {
await new Promise((r) => setTimeout(r, 1000));
const updated = await replicate.predictions.get(prediction.id);
status = updated.status;
}Common Mistake: Polling without backoff. Hammering the API every 100ms can hit rate limits. Use 1-second intervals or exponential backoff.
Fix 3: Webhooks Instead of Polling
For predictions that take minutes, webhooks are cheaper than polling:
prediction = replicate.predictions.create(
version=model_version,
input={...},
webhook="https://app.example.com/api/replicate-webhook",
webhook_events_filter=["completed"], # Or "start", "output", "logs", "completed"
)The webhook fires when the prediction reaches the filtered states. completed includes both succeeded and failed.
Your handler:
@app.post("/api/replicate-webhook")
async def handle_webhook(request: Request):
body = await request.body()
# Verify signature (recommended):
signature = request.headers.get("webhook-signature")
if not verify_signature(body, signature):
return Response(status_code=401)
payload = json.loads(body)
if payload["status"] == "succeeded":
await save_output(payload["id"], payload["output"])
elif payload["status"] == "failed":
await record_failure(payload["id"], payload["error"])
return {"ok": True}For signature verification (uses HMAC-SHA256 with a signing secret from Replicate Dashboard → API tokens):
import hmac
import hashlib
def verify_signature(body: bytes, signature: str) -> bool:
secret = os.environ["REPLICATE_WEBHOOK_SECRET"]
computed = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={computed}", signature)Pro Tip: Combine webhook with a polling fallback. Webhooks can fail (network blip, your app restart) — fall back to polling for predictions that have been pending too long.
Fix 4: Streaming Outputs
For models that support streaming (LLMs, some image gen):
# Python:
for event in replicate.stream(
"meta/meta-llama-3-70b-instruct",
input={"prompt": "Tell me about Python"},
):
print(event, end="", flush=True)// Node:
for await (const event of replicate.stream("...", { input: {...} })) {
process.stdout.write(event.toString());
}stream() yields server-sent events as the model produces them — useful for chat UIs with token-by-token output.
Not all models support streaming. Check the model’s documentation under “API Examples.”
For SSE-based streaming via fetch directly:
const response = await fetch("https://api.replicate.com/v1/predictions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.REPLICATE_API_TOKEN}`,
"content-type": "application/json",
},
body: JSON.stringify({
version: "...",
input: {...},
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
console.log(decoder.decode(value));
}Fix 5: File Inputs
For inputs like images, audio, video:
Option A — public URL:
output = replicate.run(
"ai-forever/kandinsky-2.2",
input={"image": "https://example.com/cat.jpg"},
)Replicate fetches from the URL. Must be HTTPS and publicly accessible.
Option B — base64 data URL:
import base64
with open("cat.jpg", "rb") as f:
img_bytes = f.read()
data_url = f"data:image/jpeg;base64,{base64.b64encode(img_bytes).decode()}"
output = replicate.run(
"ai-forever/kandinsky-2.2",
input={"image": data_url},
)Size-limited (typically 5-25 MB per input).
Option C — Replicate’s file upload helper:
output = replicate.run(
"...",
input={"image": open("cat.jpg", "rb")},
)The Python client uploads the file to Replicate’s hosted storage and passes the URL.
For Node, use Buffer or stream:
import fs from "node:fs";
const output = await replicate.run("...", {
input: { image: fs.createReadStream("cat.jpg") },
});The client uploads automatically.
Common Mistake: Passing local file paths as strings. Replicate’s HTTP API has no access to your filesystem. Use one of the three patterns above.
Fix 6: Deployments for Lower Cold Starts
A “prediction” runs on shared infrastructure with potential cold starts. A “deployment” is a pinned model with reserved hardware — no cold starts, predictable cost.
In the Replicate Dashboard → Deployments → New deployment:
Model: my-username/my-model
Version: abcdef0123456789
Min instances: 1
Max instances: 10
Hardware: A100 (80GB)Then call via the deployment endpoint:
deployment = replicate.deployments.get("my-username/production-deploy")
prediction = deployment.predictions.create(input={...})
prediction.wait()
print(prediction.output)Or use replicate.deployments.run:
output = replicate.run(
"my-username/production-deploy",
input={"prompt": "..."},
# Deployments are addressed by name; version is implicit.
)Deployments cost the reserved hardware’s hourly rate, even when idle. For sporadic traffic, predictions are cheaper. For latency-sensitive endpoints, deployments win.
Pro Tip: Use min_instances: 0 for cost — but expect cold-start latency on the first request after idle. For 24/7 readiness, min_instances: 1 with the smallest hardware tier.
Fix 7: Building Custom Models With Cog
Cog packages your model as a Docker image Replicate can run.
cog.yaml:
build:
gpu: true
python_version: "3.11"
python_packages:
- "torch==2.1.0"
- "transformers==4.40.0"
- "diffusers==0.27.0"
system_packages:
- "ffmpeg"
predict: "predict.py:Predictor"predict.py:
from cog import BasePredictor, Input, Path
class Predictor(BasePredictor):
def setup(self):
"""Loaded once at startup. Slow loads (model weights) go here."""
from diffusers import StableDiffusionPipeline
self.pipe = StableDiffusionPipeline.from_pretrained("...")
self.pipe.to("cuda")
def predict(
self,
prompt: str = Input(description="Prompt for generation"),
steps: int = Input(default=50, ge=1, le=100),
) -> Path:
image = self.pipe(prompt, num_inference_steps=steps).images[0]
output_path = Path("/tmp/output.png")
image.save(output_path)
return output_pathBuild and test locally:
cog build
cog predict -i prompt="a cat on a roof"Push to Replicate:
cog login
cog push r8.im/my-username/my-modelNow my-username/my-model is callable via the API.
Common Mistake: Loading weights in predict() instead of setup(). Every prediction reloads — slow. Put expensive init in setup(); it runs once when the container starts.
For big models, mount weights from a Cloudflare R2 / S3 bucket at runtime instead of baking into the image — keeps the image smaller and rebuilds faster.
Fix 8: Rate Limits and Errors
Replicate’s API has rate limits per token:
- Free tier: limited concurrent predictions.
- Paid: higher concurrency.
Common errors and handling:
import replicate
from replicate.exceptions import ModelError, ReplicateError
try:
output = replicate.run("...", input={...})
except ModelError as e:
# Model itself errored (e.g. invalid prompt, OOM).
print("Model error:", e)
except ReplicateError as e:
# API error (rate limit, auth, network).
if e.status == 429:
time.sleep(60)
# Retry
elif e.status == 401:
# Bad token
...For retries with exponential backoff:
import time
for attempt in range(5):
try:
output = replicate.run("...", input={...})
break
except ReplicateError as e:
if e.status in (429, 502, 503):
time.sleep(2 ** attempt)
continue
raiseFor production traffic, queue requests on your side (BullMQ, Sidekiq, etc.) and pull at a rate Replicate can handle.
Still Not Working?
A few less-obvious failures:
No webhook events received. Replicate sends webhooks at specific lifecycle moments. Checkwebhook_events_filter. Also verify your webhook endpoint is HTTPS and publicly accessible (no localhost).- Output is a URL, not the data. Image/audio/video outputs are URLs to Replicate-hosted files. Download to your storage if you need long-term retention — Replicate’s hosted files may expire.
File too largefor image upload. ~5-10 MB limit on inputs via base64. Use a public URL for larger files.Prediction timed out. Default timeout is per-model. For long-running predictions, check the model’spredict_timeoutin cog.yaml.- Cog build slow. Each build pushes the full image. Use
cog build --use-cuda-base-imageand pin dependencies for caching. - Webhook signatures don’t match. Replicate uses a specific signature format. Use
compare_digestfor timing-safe comparison. Verify the secret you used to sign against the one Replicate has. - Streaming events out of order. SSE is in-order at the network level but client parsing may buffer. Use a proper SSE parser.
- Predictions on shared infrastructure are slow. Cold start. Deploy to a deployment with
min_instances >= 1for predictable latency.
For related ML inference and serving issues, see Modal not working, Cloudflare Workers AI not working, LiteLLM not working, and HuggingFace Transformers not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Bedrock Not Working — Model Access, IAM, Converse API, Streaming, and Cross-Region
How to fix AWS Bedrock errors — AccessDeniedException for model access, bedrock vs bedrock-runtime client, Converse vs InvokeModel API, streaming with ConverseStream, regional availability, and Knowledge Bases setup.
Fix: Cloudflare Workers AI Not Working — AI Binding, Model IDs, Streaming, and Vectorize Integration
How to fix Cloudflare Workers AI errors — env.AI binding setup, model ID format, text-generation streaming with ReadableStream, AI Gateway, Vectorize embeddings, region availability, and Neuron-based pricing.
Fix: Modal Not Working — App vs Stub, Image Build, Volumes, GPU Selection, and Cold Starts
How to fix Modal Labs errors — modal.App vs modal.Stub deprecation, image dependencies missing, Volume vs NetworkFileSystem, GPU type mismatch, .remote vs .local invocation, web endpoint URL, and cold start tuning.
Fix: Hono RPC Not Working — Client Type Inference, AppType Export, Validators, and Path Params
How to fix Hono RPC client errors — hc<AppType> showing any, validator types not flowing, app.route chaining loses types, monorepo type import, path param typing, JSON body validation, and streaming.