Skip to content

Fix: Modal Not Working — App vs Stub, Image Build, Volumes, GPU Selection, and Cold Starts

FixDevs ·

Quick Answer

How to fix Modal Labs errors — modal.App vs modal.Stub deprecation, image dependencies missing, Volume vs NetworkFileSystem, GPU type mismatch, .remote vs .local invocation, web endpoint URL, and cold start tuning.

The Error

You write a Modal app and the new App syntax errors:

import modal

stub = modal.Stub("my-app")  # DeprecationWarning or AttributeError

@stub.function()
def hello():
    print("hi")

Or modal run fails because the image is missing a dependency:

ModuleNotFoundError: No module named 'transformers'

Or you allocate a GPU and Modal says it’s unavailable:

modal.exception.ResourceExhaustedError: A100 capacity is unavailable in the region

Or .local() runs locally instead of on Modal:

result = my_function.local(...)
# Runs in your terminal, not in the cloud.

Why This Happens

Modal is Python-native serverless: you write Python, decorate it, and Modal handles container build, GPU allocation, scaling, and queues. Most issues map to one of:

  • StubApp rename. Modal 0.62+ renamed Stub to App. Both still work in transition versions but new code should use App. Older tutorials use Stub.
  • Images are declarative. You build an image with modal.Image.debian_slim().pip_install(...) chains. The image is built remotely; if a dep is missing, you have to add it to the image — not just import locally.
  • Invocation styles. .remote(...) runs on Modal; .local(...) runs in your local Python; .map(...) runs many in parallel on Modal. Mixing them produces confusing results.
  • GPU types differ by availability. gpu="a100" may not be available; gpu="any" lets Modal pick. Pinning a specific GPU can fail at runtime when capacity runs out.

Fix 1: Use modal.App (Not Stub)

import modal

app = modal.App("my-app")  # New name

@app.function()
def hello():
    print("hi")

@app.local_entrypoint()
def main():
    hello.remote()

For older code that still uses Stub, both should work in current versions:

# Both equivalent — but App is the future:
app = modal.App("my-app")
stub = modal.Stub("my-app")  # Deprecated alias

Run from CLI:

modal run my_app.py
# or
modal run my_app.py::main   # Specify entry point

For deployment (persistent functions accessible via API):

modal deploy my_app.py

run is one-shot; deploy makes the functions callable from anywhere with API credentials.

Pro Tip: Use modal serve my_app.py during development. It deploys with live reload — file changes trigger a re-deploy automatically.

Fix 2: Build the Image With Your Dependencies

Modal runs each function in its own container. The container’s image must include every package your function imports:

import modal

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("transformers", "torch", "numpy")
    .apt_install("git")
    .env({"MY_VAR": "value"})
)

app = modal.App("my-app", image=image)

@app.function()
def use_transformers():
    from transformers import pipeline  # Now available
    pipe = pipeline("sentiment-analysis")
    return pipe("I love this!")

The image is built remotely the first time you run. Subsequent runs reuse the cached image.

For project-local code:

image = (
    modal.Image.debian_slim()
    .pip_install("transformers")
    .add_local_dir("./my_package", remote_path="/root/my_package")
)

add_local_dir syncs a local directory into the image. For Python source you want to call from a function:

import sys

@app.function(image=image)
def my_func():
    sys.path.append("/root")  # If using add_local_dir
    from my_package import thing
    return thing()

Common Mistake: Installing pip packages outside the image chain:

# Wrong — only installs in your local venv:
# pip install transformers  

# Right — installs in the Modal image:
image = modal.Image.debian_slim().pip_install("transformers")

Fix 3: Pick the Right GPU

Modal supports several GPU types. As of 2025-2026:

@app.function(gpu="any")          # Anything available (cheapest)
@app.function(gpu="T4")           # T4 (16 GB)
@app.function(gpu="L4")           # L4 (24 GB)
@app.function(gpu="A10G")         # A10G (24 GB)
@app.function(gpu="A100")         # A100 (40 GB)
@app.function(gpu="A100-80GB")    # A100 80 GB
@app.function(gpu="H100")         # H100 (80 GB)

For multi-GPU:

@app.function(gpu=modal.gpu.A100(count=4))
def train():
    # 4x A100 in one container
    pass

If your preferred GPU isn’t available (ResourceExhaustedError), Modal will queue or fail. Two strategies:

# Fallback list (Modal tries in order):
@app.function(gpu=["H100", "A100-80GB", "A100"])
def train():
    pass

# Or just "any" and detect inside:
@app.function(gpu="any")
def train():
    import torch
    print(torch.cuda.get_device_name(0))

Pro Tip: Don’t pin to H100 unless you actually need its specific features (FP8, NVL chains). T4/A10G are cheaper and available faster.

Fix 4: .remote(), .local(), .map()

@app.function()
def square(x: int) -> int:
    return x * x

@app.local_entrypoint()
def main():
    # Runs on Modal:
    result = square.remote(5)
    print(result)  # 25

    # Runs locally in your terminal:
    result = square.local(5)
    print(result)  # 25, but didn't use Modal

    # Runs many in parallel on Modal:
    results = list(square.map(range(10)))
    print(results)  # [0, 1, 4, 9, ..., 81]

.remote() — single call on Modal. Returns the result. .local() — runs in your local Python process (for testing). .map() — batch parallel. Returns a generator of results. .spawn() — fire-and-forget; returns a FunctionCall handle for later polling.

For thousands of parallel calls:

results = list(square.map(range(10000), order_outputs=False))

order_outputs=False lets results return as they finish (faster). Default is True (matches input order, slower for skewed durations).

Common Mistake: Calling square(5) directly (no .remote() or .local()). This calls the bare function in your local Python — same as .local() but without making it explicit.

Fix 5: Volumes for Persistent Storage

Containers are ephemeral. For persistent data, use modal.Volume:

volume = modal.Volume.from_name("my-vol", create_if_missing=True)

@app.function(volumes={"/data": volume})
def write_data():
    with open("/data/file.txt", "w") as f:
        f.write("hello")
    volume.commit()  # Persist changes

@app.function(volumes={"/data": volume})
def read_data():
    volume.reload()  # Get latest state
    with open("/data/file.txt") as f:
        return f.read()

Two important calls:

  • volume.commit() — saves changes back to the Volume after writes. Without it, writes are lost when the container exits.
  • volume.reload() — pulls fresh state before reading. Without it, you may read stale cached data.

For caching model weights:

weights_vol = modal.Volume.from_name("model-weights", create_if_missing=True)

@app.function(
    volumes={"/cache": weights_vol},
    image=image,
)
def inference(prompt: str):
    import os
    os.environ["HF_HOME"] = "/cache/huggingface"
    # First call downloads ~5GB to /cache; subsequent calls reuse.
    from transformers import pipeline
    pipe = pipeline("text-generation", model="meta-llama/Llama-3-8B")
    return pipe(prompt)

The first inference downloads the model; subsequent inferences (even in fresh containers) reuse the Volume.

Pro Tip: Use a separate Volume for each large dataset/model. Volumes have per-Volume read/write caches, so isolating them gives the best cold-start times.

Fix 6: Web Endpoints

Expose a function as an HTTPS endpoint:

@app.function(image=image)
@modal.web_endpoint(method="POST")
def predict(payload: dict):
    result = my_model.predict(payload["input"])
    return {"result": result}

After modal deploy, the URL is in the deploy output. Or via CLI:

modal app list
modal app show my-app

For FastAPI integration (more powerful):

from fastapi import FastAPI

web_app = FastAPI()

@web_app.post("/predict")
def predict(payload: dict):
    return {"result": my_model.predict(payload["input"])}

@app.function(image=image)
@modal.asgi_app()
def fastapi_app():
    return web_app

This exposes the FastAPI app at one Modal-assigned URL. All routes work as in any FastAPI deploy.

Common Mistake: Authenticated web endpoints without proper headers. Modal’s web endpoints can be public or require an API token:

@modal.web_endpoint(method="POST", requires_proxy_auth=True)
def secure_endpoint(...): ...

requires_proxy_auth=True blocks unauthenticated callers at the Modal edge.

Fix 7: Secrets

Don’t put API keys in your code. Use Modal Secrets:

modal secret create openai-secret OPENAI_API_KEY=sk-...

Or via the dashboard.

Reference in functions:

@app.function(
    image=image,
    secrets=[modal.Secret.from_name("openai-secret")],
)
def call_openai(prompt: str):
    import os
    from openai import OpenAI
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    return client.chat.completions.create(...)

The secret’s env vars are injected at runtime; they don’t appear in your image or logs.

For multiple secrets:

secrets=[
    modal.Secret.from_name("openai-secret"),
    modal.Secret.from_name("anthropic-secret"),
    modal.Secret.from_name("aws-creds"),
]

For dynamic secrets (per-deployment overrides):

modal.Secret.from_dict({"DB_URL": os.environ["LOCAL_DB_URL"]})

from_dict is fine for dev but don’t commit hard-coded secrets that way.

Fix 8: Cold Starts and container_idle_timeout

Each cold start downloads the image, starts the container, and runs your function. For latency-sensitive workloads, keep containers warm:

@app.function(
    image=image,
    gpu="A10G",
    container_idle_timeout=300,  # Keep idle containers alive for 5 minutes
    allow_concurrent_inputs=10,  # One container handles up to 10 concurrent calls
    keep_warm=1,                 # Always have 1 container ready
)
def predict(payload: dict):
    ...

Three controls:

  • container_idle_timeout — how long a container sits idle before being killed.
  • allow_concurrent_inputs — concurrent requests per container. Higher means fewer cold starts but more memory pressure.
  • keep_warm — number of always-running containers. Costs money even when idle, but cold-start latency drops to zero.

For GPU functions, even keep_warm=1 is expensive. Use it for production endpoints; for batch jobs, accept cold starts.

Pro Tip: Test cold-start time with modal run --detach. The first call after deploy is a cold start; subsequent calls are warm.

Still Not Working?

A few less-obvious failures:

  • ImportError despite pip_install. Modal caches images by chain hash. Adding a pip_install after a run_commands may invalidate the cache for run_commands. Order chain steps from least-changing to most-changing.
  • TimeoutError: function exceeded 600s. Default function timeout is 10 minutes. Bump via @app.function(timeout=3600) (1 hour). Max varies by Modal plan.
  • GPU function runs on CPU. No gpu= set. Decorator must include gpu="any" or specific type.
  • add_local_dir not picking up changes. Modal caches local syncs. Force re-sync with --detach or bump the version of your function decorator.
  • Volumes diverge across regions. Volumes are region-scoped. Functions in different regions reading the same Volume name access different physical volumes. Pin function region or use a different storage backend.
  • modal token set fails in CI. CI environments need API token via env var: MODAL_TOKEN_ID and MODAL_TOKEN_SECRET. Generate from the dashboard.
  • modal.exception.InvalidError: function does not exist. Either the function isn’t deployed yet (run modal deploy) or the function name in the lookup doesn’t match. Use Function.lookup("my-app", "my-func") exact names.
  • Class.cls deprecated. Modal moved from Stub.cls to App.cls along with Stub → App. Update.

For related Python deployment and ML serving issues, see vLLM not working, AWS Lambda timeout, PyTorch not working, and Docker no space left on device.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles