Skip to content

Fix: Replicate Not Working — Model Versions, Prediction Polling, Webhooks, and Cog Build

FixDevs ·

Quick Answer

How to fix Replicate API errors — model version ID required, prediction polling vs streaming, webhook signature verification, file inputs and HTTPS URLs, cold start latency, Cog deployment, and deployments vs predictions.

The Error

You call the Replicate API and get a version error:

HTTPError: 422 Client Error: Unprocessable Entity
{"detail": "version is required"}

Or prediction.output is None:

prediction = replicate.predictions.create(...)
print(prediction.output)  # None
print(prediction.status)  # "starting"

Or your webhook never fires:

[webhook handler] No predictions arriving...

Or cog build fails:

cog: error: building image failed: layer ... too large

Why This Happens

Replicate hosts ML models accessible via HTTP API. Most issues map to:

  • Predictions are async. predictions.create returns immediately with a job in starting state. You either poll, use webhooks, or call the convenience run() method (which polls internally).
  • Model versions are required. A model URL like replicate/stable-diffusion is ambiguous — versions are commit-like IDs. Either pin a version or use the helper that picks the latest.
  • File inputs need URLs or base64. Local file paths don’t work over HTTP. Either upload to your own storage and pass the URL, or base64-encode inline (size-limited).
  • Cog (Replicate’s containerization tool) builds Docker images of your ML code. Big images, GPU dependencies, slow builds.

Fix 1: Specify the Model Version

import replicate

# Use the model:version shorthand:
output = replicate.run(
    "stability-ai/stable-diffusion-3:abcdef0123456789",
    input={"prompt": "a cat on a roof", "width": 1024, "height": 1024},
)
print(output)

stability-ai/stable-diffusion-3:abcdef0123456789username/model:version_id. The version ID is a hash of the deployed model.

To find the latest version:

model = replicate.models.get("stability-ai/stable-diffusion-3")
latest_version = model.latest_version.id
print(latest_version)

Or skip the version (uses model.latest_version):

output = replicate.run(
    "stability-ai/stable-diffusion-3",  # No version — uses latest
    input={"prompt": "..."},
)

For production, pin specific versions to avoid surprises:

STABLE_DIFFUSION_VERSION = "stability-ai/stable-diffusion-3:abcdef0123456789"

output = replicate.run(STABLE_DIFFUSION_VERSION, input={...})

For Node:

import Replicate from "replicate";

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const output = await replicate.run(
  "stability-ai/stable-diffusion-3:abcdef0123456789",
  { input: { prompt: "a cat" } },
);

Pro Tip: Pin versions per environment. Dev can track latest; production should pin so a model update doesn’t accidentally change your output.

Fix 2: run() vs Manual Polling

The run() method blocks until the prediction completes:

output = replicate.run(model_version, input={...})
# Returns the final output, polling internally.

For more control (e.g. show progress to users), call predictions.create and poll:

prediction = replicate.predictions.create(
    version=model_version,
    input={"prompt": "..."},
)

while prediction.status not in ("succeeded", "failed", "canceled"):
    time.sleep(1)
    prediction.reload()
    print(prediction.status)  # starting → processing → succeeded

if prediction.status == "succeeded":
    print(prediction.output)
elif prediction.status == "failed":
    print(prediction.error)

For Node:

const prediction = await replicate.predictions.create({
  version: "abcdef0123456789",
  input: { prompt: "..." },
});

let status = prediction.status;
while (status === "starting" || status === "processing") {
  await new Promise((r) => setTimeout(r, 1000));
  const updated = await replicate.predictions.get(prediction.id);
  status = updated.status;
}

Common Mistake: Polling without backoff. Hammering the API every 100ms can hit rate limits. Use 1-second intervals or exponential backoff.

Fix 3: Webhooks Instead of Polling

For predictions that take minutes, webhooks are cheaper than polling:

prediction = replicate.predictions.create(
    version=model_version,
    input={...},
    webhook="https://app.example.com/api/replicate-webhook",
    webhook_events_filter=["completed"],  # Or "start", "output", "logs", "completed"
)

The webhook fires when the prediction reaches the filtered states. completed includes both succeeded and failed.

Your handler:

@app.post("/api/replicate-webhook")
async def handle_webhook(request: Request):
    body = await request.body()
    
    # Verify signature (recommended):
    signature = request.headers.get("webhook-signature")
    if not verify_signature(body, signature):
        return Response(status_code=401)
    
    payload = json.loads(body)
    if payload["status"] == "succeeded":
        await save_output(payload["id"], payload["output"])
    elif payload["status"] == "failed":
        await record_failure(payload["id"], payload["error"])
    
    return {"ok": True}

For signature verification (uses HMAC-SHA256 with a signing secret from Replicate Dashboard → API tokens):

import hmac
import hashlib

def verify_signature(body: bytes, signature: str) -> bool:
    secret = os.environ["REPLICATE_WEBHOOK_SECRET"]
    computed = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={computed}", signature)

Pro Tip: Combine webhook with a polling fallback. Webhooks can fail (network blip, your app restart) — fall back to polling for predictions that have been pending too long.

Fix 4: Streaming Outputs

For models that support streaming (LLMs, some image gen):

# Python:
for event in replicate.stream(
    "meta/meta-llama-3-70b-instruct",
    input={"prompt": "Tell me about Python"},
):
    print(event, end="", flush=True)
// Node:
for await (const event of replicate.stream("...", { input: {...} })) {
  process.stdout.write(event.toString());
}

stream() yields server-sent events as the model produces them — useful for chat UIs with token-by-token output.

Not all models support streaming. Check the model’s documentation under “API Examples.”

For SSE-based streaming via fetch directly:

const response = await fetch("https://api.replicate.com/v1/predictions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.REPLICATE_API_TOKEN}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    version: "...",
    input: {...},
    stream: true,
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}

Fix 5: File Inputs

For inputs like images, audio, video:

Option A — public URL:

output = replicate.run(
    "ai-forever/kandinsky-2.2",
    input={"image": "https://example.com/cat.jpg"},
)

Replicate fetches from the URL. Must be HTTPS and publicly accessible.

Option B — base64 data URL:

import base64

with open("cat.jpg", "rb") as f:
    img_bytes = f.read()

data_url = f"data:image/jpeg;base64,{base64.b64encode(img_bytes).decode()}"

output = replicate.run(
    "ai-forever/kandinsky-2.2",
    input={"image": data_url},
)

Size-limited (typically 5-25 MB per input).

Option C — Replicate’s file upload helper:

output = replicate.run(
    "...",
    input={"image": open("cat.jpg", "rb")},
)

The Python client uploads the file to Replicate’s hosted storage and passes the URL.

For Node, use Buffer or stream:

import fs from "node:fs";

const output = await replicate.run("...", {
  input: { image: fs.createReadStream("cat.jpg") },
});

The client uploads automatically.

Common Mistake: Passing local file paths as strings. Replicate’s HTTP API has no access to your filesystem. Use one of the three patterns above.

Fix 6: Deployments for Lower Cold Starts

A “prediction” runs on shared infrastructure with potential cold starts. A “deployment” is a pinned model with reserved hardware — no cold starts, predictable cost.

In the Replicate Dashboard → Deployments → New deployment:

Model:        my-username/my-model
Version:      abcdef0123456789
Min instances: 1
Max instances: 10
Hardware:     A100 (80GB)

Then call via the deployment endpoint:

deployment = replicate.deployments.get("my-username/production-deploy")
prediction = deployment.predictions.create(input={...})
prediction.wait()
print(prediction.output)

Or use replicate.deployments.run:

output = replicate.run(
    "my-username/production-deploy",
    input={"prompt": "..."},
    # Deployments are addressed by name; version is implicit.
)

Deployments cost the reserved hardware’s hourly rate, even when idle. For sporadic traffic, predictions are cheaper. For latency-sensitive endpoints, deployments win.

Pro Tip: Use min_instances: 0 for cost — but expect cold-start latency on the first request after idle. For 24/7 readiness, min_instances: 1 with the smallest hardware tier.

Fix 7: Building Custom Models With Cog

Cog packages your model as a Docker image Replicate can run.

cog.yaml:

build:
  gpu: true
  python_version: "3.11"
  python_packages:
    - "torch==2.1.0"
    - "transformers==4.40.0"
    - "diffusers==0.27.0"
  system_packages:
    - "ffmpeg"

predict: "predict.py:Predictor"

predict.py:

from cog import BasePredictor, Input, Path

class Predictor(BasePredictor):
    def setup(self):
        """Loaded once at startup. Slow loads (model weights) go here."""
        from diffusers import StableDiffusionPipeline
        self.pipe = StableDiffusionPipeline.from_pretrained("...")
        self.pipe.to("cuda")
    
    def predict(
        self,
        prompt: str = Input(description="Prompt for generation"),
        steps: int = Input(default=50, ge=1, le=100),
    ) -> Path:
        image = self.pipe(prompt, num_inference_steps=steps).images[0]
        output_path = Path("/tmp/output.png")
        image.save(output_path)
        return output_path

Build and test locally:

cog build
cog predict -i prompt="a cat on a roof"

Push to Replicate:

cog login
cog push r8.im/my-username/my-model

Now my-username/my-model is callable via the API.

Common Mistake: Loading weights in predict() instead of setup(). Every prediction reloads — slow. Put expensive init in setup(); it runs once when the container starts.

For big models, mount weights from a Cloudflare R2 / S3 bucket at runtime instead of baking into the image — keeps the image smaller and rebuilds faster.

Fix 8: Rate Limits and Errors

Replicate’s API has rate limits per token:

  • Free tier: limited concurrent predictions.
  • Paid: higher concurrency.

Common errors and handling:

import replicate
from replicate.exceptions import ModelError, ReplicateError

try:
    output = replicate.run("...", input={...})
except ModelError as e:
    # Model itself errored (e.g. invalid prompt, OOM).
    print("Model error:", e)
except ReplicateError as e:
    # API error (rate limit, auth, network).
    if e.status == 429:
        time.sleep(60)
        # Retry
    elif e.status == 401:
        # Bad token
        ...

For retries with exponential backoff:

import time

for attempt in range(5):
    try:
        output = replicate.run("...", input={...})
        break
    except ReplicateError as e:
        if e.status in (429, 502, 503):
            time.sleep(2 ** attempt)
            continue
        raise

For production traffic, queue requests on your side (BullMQ, Sidekiq, etc.) and pull at a rate Replicate can handle.

Still Not Working?

A few less-obvious failures:

  • No webhook events received. Replicate sends webhooks at specific lifecycle moments. Check webhook_events_filter. Also verify your webhook endpoint is HTTPS and publicly accessible (no localhost).
  • Output is a URL, not the data. Image/audio/video outputs are URLs to Replicate-hosted files. Download to your storage if you need long-term retention — Replicate’s hosted files may expire.
  • File too large for image upload. ~5-10 MB limit on inputs via base64. Use a public URL for larger files.
  • Prediction timed out. Default timeout is per-model. For long-running predictions, check the model’s predict_timeout in cog.yaml.
  • Cog build slow. Each build pushes the full image. Use cog build --use-cuda-base-image and pin dependencies for caching.
  • Webhook signatures don’t match. Replicate uses a specific signature format. Use compare_digest for timing-safe comparison. Verify the secret you used to sign against the one Replicate has.
  • Streaming events out of order. SSE is in-order at the network level but client parsing may buffer. Use a proper SSE parser.
  • Predictions on shared infrastructure are slow. Cold start. Deploy to a deployment with min_instances >= 1 for predictable latency.

For related ML inference and serving issues, see Modal not working, Cloudflare Workers AI not working, LiteLLM not working, and HuggingFace Transformers not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles