Skip to content

Fix: Weights & Biases (wandb) Not Working — Login Errors, Init Hangs, and Sync Failures

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix wandb errors — API key not set login failed, wandb init hangs, offline mode sync, artifact upload failure, run not showing in dashboard, image logging size limit, and sweep agent not starting.

The Error

You try to log in and wandb complains about missing credentials:

wandb: ERROR Unable to authenticate with W&B. Please check your API key.
wandb.errors.UsageError: api_key not configured

Or wandb.init() hangs forever without connecting:

import wandb
wandb.init(project="my-project")
# Hangs for 5+ minutes, no output

Or training runs succeed but don’t appear in the dashboard:

wandb: Run history: (empty)
wandb: Run summary: (empty)

Or you try to log a large image and get a size error:

wandb: WARNING Media of type images exceeded 100MB per-step limit

Or hyperparameter sweeps start but agents don’t pick up runs:

wandb: ERROR sweep agent could not connect to sweep controller

wandb is a hosted MLOps platform with a Python SDK that communicates with its backend via HTTP and WebSockets. When the SDK can’t reach the backend, can’t authenticate, or encounters a version/environment mismatch, it fails in specific ways. This guide covers each failure mode.

Why This Happens

wandb runs two processes per training job: the main Python process calling wandb.log(), and a background process that uploads data to the W&B backend. When wandb.init() is called, the SDK spawns the background process and opens a WebSocket connection. If the process can’t start (missing dependencies, permission issues) or the connection can’t be made (firewall, wrong API key), the SDK may hang for a long time before raising.

Authentication requires an API key from wandb.ai. The SDK reads it from ~/.netrc, the WANDB_API_KEY environment variable, or prompts interactively — but only when running in a TTY. In CI or Docker, prompting fails silently.

Diagnostic Timeline: “My Run Isn’t in the Dashboard”

The first instinct is to re-run wandb.init() with a fresh run name. Don’t. The run almost always succeeded — it just landed somewhere you are not looking. Here is the actual triage path.

Minute 0 — Capture the run URL the SDK printed. When wandb.init() returns, it prints wandb: 🚀 View run at https://wandb.ai/<entity>/<project>/runs/<id>. Copy that URL exactly. If the URL loads and shows metrics, the run is fine and your dashboard filter is hiding it (most common: filtered by tag, group, or job_type from a previous session).

Minute 1 — Check WANDB_MODE. Run import os; print(os.environ.get("WANDB_MODE")). If it returns "offline" or "disabled", your run never uploaded. Offline mode writes to ./wandb/offline-run-*/ and waits for an explicit wandb sync. Disabled mode does nothing. Both are easy to set accidentally — a teammate’s .env file or a CI job that exports WANDB_MODE=offline for tests will leak into your runs.

Minute 2 — Verify entity vs project mismatch. Open the URL and check whether the <entity> is your username or a team name. If you intended to log to a team project but the SDK fell back to your personal account, the project exists in two places — and you are watching the wrong one. Pass entity="<team>" explicitly to wandb.init().

Minute 4 — Check distributed run grouping. Under PyTorch DDP, only rank 0 should call wandb.init() without group; ranks > 0 should either skip wandb or join the same group. If every rank logs as a separate run, you see four short runs instead of one — and metrics like train_loss look truncated.

Minute 7 — Confirm wandb.finish() was called. Without it, the sync process may exit before the last wandb.log() calls flush. Tail metrics vanish, the run appears “stuck running” in the UI for an hour, and the last epoch numbers never appear.

The first guess (“re-init the run”) is usually wrong. Real causes: WANDB_MODE=offline left over from another script, entity mismatch sending the run to the wrong project, or DDP launching N parallel runs that look like one missing run.

Fix 1: API Key and Login

wandb: ERROR Unable to authenticate with W&B. Please check your API key.

Get your API key from wandb.ai/authorize. Three ways to provide it:

Method 1: Interactive login (dev machine, one-time setup):

wandb login
# Prompts for your API key and saves to ~/.netrc

Method 2: Environment variable (CI, Docker, cloud):

export WANDB_API_KEY=your-key-here
python train.py
# In a script — read from env var before wandb.init
import os
os.environ["WANDB_API_KEY"] = os.environ.get("WANDB_API_KEY") or "fallback-key"

import wandb
wandb.init(project="my-project")

Method 3: Explicit login in Python:

import wandb
wandb.login(key="your-key-here")
wandb.init(project="my-project")

Check current login status:

import wandb
api = wandb.Api()
print(api.viewer.username)   # Prints your username if logged in

In Docker/Kubernetes, mount the key as a secret:

FROM python:3.12
RUN pip install wandb
# Never bake the key into the image — pass at runtime
CMD ["python", "train.py"]
docker run -e WANDB_API_KEY=xxx my-image

Common Mistake: Committing ~/.netrc or a script with hardcoded API key to a public repo. W&B keys grant access to all your projects. If you accidentally commit one, rotate it immediately at wandb.ai/settings → Danger Zone → Reset API key.

Fix 2: wandb.init() Hangs

The SDK tries to sync immediately. If the network is slow, the backend is unreachable, or disk I/O is slow, init can hang for minutes.

Set a timeout:

import wandb

wandb.init(
    project="my-project",
    settings=wandb.Settings(
        init_timeout=60,   # Give up after 60 seconds (default: 90)
    ),
)

Use offline mode when internet is unreliable:

import wandb

# Environment variable
import os
os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")
# Runs locally, saves to ./wandb/ directory

# Later, sync manually
# wandb sync ./wandb/offline-run-YYYYMMDD_HHMMSS-xxxxx/

Disable wandb entirely for debug runs:

export WANDB_MODE=disabled
# wandb calls become no-ops — no network, no disk
python train.py
# Or per-run
os.environ["WANDB_MODE"] = "disabled"

Check network connectivity to wandb:

curl -I https://api.wandb.ai
# Expected: HTTP/2 200

If this fails, your firewall is blocking wandb. Corporate networks often need:

# Proxy environment variables
export HTTPS_PROXY=http://proxy.company.com:8080
export HTTP_PROXY=http://proxy.company.com:8080

# Or set the wandb base URL if your org has a self-hosted instance
export WANDB_BASE_URL=https://wandb.internal.company.com

Fix 3: Runs Don’t Appear in Dashboard

You log metrics but nothing shows in the wandb UI.

Step 1: Verify the run actually started:

import wandb

run = wandb.init(project="my-project")
print(f"Run URL: {run.url}")
# https://wandb.ai/your-username/my-project/runs/abc123

Copy that URL — if the page loads with “Run not found”, the init failed despite no error.

Step 2: Verify metrics are logged with wandb.log():

import wandb

wandb.init(project="my-project")

for step in range(100):
    loss = compute_loss()
    wandb.log({"loss": loss, "step": step})   # Must be a dict
    # wandb.log(loss)   # WRONG — must be a dict

wandb.finish()   # Explicitly close the run

wandb.finish() is important at the end of your script. Without it, the sync process may not flush the last metrics before Python exits.

Step 3: Check the project and entity match your account:

wandb.init(
    project="my-project",
    entity="my-username",   # Or team name — defaults to your personal account
)

If entity is wrong (e.g., you’re trying to log to a team you’re not on), init fails silently in some versions.

Step 4: Check for offline mode accidentally enabled:

import os
print(os.environ.get("WANDB_MODE"))   # Should be None or "online"

If it’s “offline” or “disabled”, unset it:

unset WANDB_MODE

Fix 4: Logging Different Data Types

wandb logs more than scalars — but each type has constraints.

Scalars:

wandb.log({"loss": 0.23, "accuracy": 0.94, "lr": 1e-4})

Images:

import wandb
from PIL import Image
import numpy as np

# PIL Image
wandb.log({"example": wandb.Image(pil_image, caption="Ground truth")})

# NumPy array (H, W, C) uint8
img_arr = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
wandb.log({"random": wandb.Image(img_arr)})

# File path
wandb.log({"saved": wandb.Image("plot.png")})

# Multiple images in one log call
wandb.log({
    "examples": [wandb.Image(img, caption=f"Sample {i}") for i, img in enumerate(images)]
})

Tables (for complex data):

import wandb
import pandas as pd

df = pd.DataFrame({
    "image": [wandb.Image(img) for img in images],
    "label": labels,
    "prediction": preds,
    "correct": [l == p for l, p in zip(labels, preds)],
})

table = wandb.Table(dataframe=df)
wandb.log({"results": table})

Media size limits:

  • 100 MB per-step for images/video/audio (hard limit)
  • 10,000 rows per table (performance warning above this)
  • Artifact size limits vary by plan

Pro Tip: Log images at training resolution, not raw sensor size. A 4K image at every step pushes you over the 100MB limit fast. Resize to the actual input size your model sees (224x224 for ImageNet, 512x512 for segmentation) before logging. You can always log a few full-resolution samples separately as an artifact.

Custom x-axis:

# By default, wandb plots against the log step number
wandb.log({"loss": 0.5})            # Step 0
wandb.log({"loss": 0.4})            # Step 1

# Custom step from your training loop
for epoch in range(100):
    wandb.log({"loss": loss, "epoch": epoch}, step=epoch)

# Define a metric's x-axis explicitly
wandb.define_metric("epoch")
wandb.define_metric("train_loss", step_metric="epoch")
wandb.log({"epoch": 5, "train_loss": 0.2})
# train_loss plots against epoch, not the internal step counter

Fix 5: Artifacts — Upload and Download

Artifacts are versioned datasets, models, and other files tracked by wandb.

Upload a model as an artifact:

import wandb

wandb.init(project="my-project")

# Log a model file
artifact = wandb.Artifact(
    name="my-model",
    type="model",
    description="Best checkpoint from epoch 42",
    metadata={"accuracy": 0.94, "dataset": "imagenet-v2"},
)
artifact.add_file("model.pt")
wandb.log_artifact(artifact)

wandb.finish()

Download an artifact in a later run:

import wandb

run = wandb.init(project="my-project")

# Download the latest version
artifact = run.use_artifact("my-model:latest")
artifact_dir = artifact.download()   # Returns local path
print(f"Downloaded to: {artifact_dir}")

# Specific version
artifact = run.use_artifact("my-model:v3")

# From another project or entity
artifact = run.use_artifact("other-team/other-project/my-model:latest")

Aliases for easy version management:

# Promote a version to a named alias
artifact = run.use_artifact("my-model:v42")
artifact.aliases.append("production")
artifact.save()

# Later, load by alias
artifact = run.use_artifact("my-model:production")

Avoid re-uploading identical files — wandb deduplicates by content hash:

artifact = wandb.Artifact("dataset", type="dataset")
artifact.add_dir("./data")   # Uploads only new/changed files
wandb.log_artifact(artifact)

Fix 6: Offline Mode and Syncing

For training on air-gapped clusters or when you want to decouple training from upload:

import os
os.environ["WANDB_MODE"] = "offline"

import wandb
wandb.init(project="my-project")
# Metrics save locally to ./wandb/offline-run-*/

Sync manually when network becomes available:

# Sync a specific run
wandb sync ./wandb/offline-run-20250409_143022-abc123

# Sync all offline runs
wandb sync ./wandb/

# Sync and delete local copy after upload
wandb sync --clean ./wandb/

Dryrun mode (saves metadata but doesn’t sync even with sync command):

os.environ["WANDB_MODE"] = "dryrun"   # Local-only, no sync

For MLflow-style experiment tracking with similar offline patterns, see MLflow not working.

Sweeps run hyperparameter searches across multiple workers. Configuration is stored on the wandb backend; agents poll for trials to run.

Define the sweep:

sweep_config = {
    "method": "bayes",   # random, grid, or bayes
    "metric": {
        "name": "val_loss",
        "goal": "minimize",
    },
    "parameters": {
        "learning_rate": {"distribution": "log_uniform_values", "min": 1e-5, "max": 1e-1},
        "batch_size": {"values": [16, 32, 64, 128]},
        "optimizer": {"values": ["adam", "sgd"]},
        "dropout": {"distribution": "uniform", "min": 0.0, "max": 0.5},
    },
}

sweep_id = wandb.sweep(sweep_config, project="my-project")
print(f"Sweep ID: {sweep_id}")

Training function:

def train():
    wandb.init()
    config = wandb.config   # Hyperparameters chosen by the sweep controller

    model = build_model(
        lr=config.learning_rate,
        batch_size=config.batch_size,
        optimizer=config.optimizer,
        dropout=config.dropout,
    )

    for epoch in range(10):
        train_loss = train_one_epoch(model)
        val_loss = validate(model)
        wandb.log({"train_loss": train_loss, "val_loss": val_loss, "epoch": epoch})

    wandb.finish()

Run the agent — each agent picks up trials from the sweep:

# Run 10 trials
wandb.agent(sweep_id, function=train, count=10)

# Or from CLI
# wandb agent your-username/my-project/sweep-id

Multiple agents for parallelism — run the agent command on different machines, all pointing at the same sweep_id:

# Machine 1
wandb agent your-username/my-project/sweep-id

# Machine 2 (same sweep)
wandb agent your-username/my-project/sweep-id

Sweep agent won’t start — usually a connection issue. Check:

wandb verify   # Checks network and auth

For Optuna-based hyperparameter tuning as an alternative, see Optuna not working.

Fix 8: Framework Integrations

wandb integrates with PyTorch, TensorFlow, Lightning, Transformers, and sklearn.

PyTorch Lightning:

from pytorch_lightning.loggers import WandbLogger
import pytorch_lightning as pl

wandb_logger = WandbLogger(project="my-project", log_model="all")

trainer = pl.Trainer(
    logger=wandb_logger,
    max_epochs=10,
)
trainer.fit(model, train_loader, val_loader)
# wandb_logger auto-captures all pl.Trainer metrics

HuggingFace Transformers:

import os
os.environ["WANDB_PROJECT"] = "my-project"

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./output",
    report_to="wandb",   # Auto-logs to wandb
    run_name="experiment-1",
    num_train_epochs=3,
    logging_steps=10,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()

For HuggingFace-specific issues like HF_TOKEN authentication and model loading, see HuggingFace Transformers not working.

PyTorch (manual):

import wandb
import torch

wandb.init(project="my-project")
wandb.watch(model, log="all", log_freq=100)   # Track gradients and weights

for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        wandb.log({"loss": loss.item()})

wandb.finish()

For PyTorch gradient logging and CUDA memory issues that affect wandb tracking, see PyTorch not working.

Still Not Working?

Self-Hosted wandb

For enterprise / on-premise deployments:

export WANDB_BASE_URL=https://wandb.internal.company.com
export WANDB_API_KEY=key-from-your-internal-instance

Resuming a Run After a Crash

import wandb

# Resume by run ID
run = wandb.init(project="my-project", id="abc123", resume="must")

# Resume if exists, create new if not
run = wandb.init(project="my-project", id="abc123", resume="allow")

Disabling Telemetry

os.environ["WANDB_DISABLE_GIT"] = "true"      # Don't record git info
os.environ["WANDB_SILENT"] = "true"            # Suppress output messages
os.environ["WANDB_CONSOLE"] = "off"            # Don't capture stdout/stderr

Cleaning Up Local wandb Directories

Each run creates a directory in ./wandb/. They accumulate fast:

# List runs
ls ./wandb/

# Remove all local run directories (data already synced to cloud)
wandb artifact cache cleanup 1GB   # Keep at most 1GB of cache
rm -rf ./wandb/run-*               # Manual cleanup after syncing

When you run the same training script multiple times (different seeds, slight config tweaks), group them so the dashboard treats them as one experiment:

wandb.init(
    project="my-project",
    group="resnet50-sweep",      # All runs with this group show together
    job_type="train",            # eval, train, preprocess — for filtering
    tags=["baseline", "cuda"],   # Free-form labels
)

Groups let you aggregate metrics (mean/std across seeds) and compare entire experiment sets in the dashboard.

Programmatic Access to Past Runs

import wandb

api = wandb.Api()

# Get all runs in a project
runs = api.runs("your-username/my-project")
for run in runs:
    print(run.name, run.state, run.summary.get("val_loss"))

# Download files from a run
run = api.run("your-username/my-project/abc123")
run.file("model.pt").download(replace=True)

# Best run by metric
import pandas as pd
df = pd.DataFrame([{
    "name": r.name,
    "val_loss": r.summary.get("val_loss"),
    "lr": r.config.get("learning_rate"),
} for r in runs if r.state == "finished"])
best = df.sort_values("val_loss").head(5)
print(best)

Distributed Training Producing Phantom Runs

In PyTorch DDP, FSDP, or DeepSpeed, every rank executes the training script. If every rank calls wandb.init(), you get N parallel runs — and metrics like loss are logged N times per step, often with conflicting values. Standard pattern: only call wandb.init() when local_rank == 0 (or rank == 0 for multi-node), and pass group="ddp-run-<timestamp>" so the run can be aggregated. If you do need per-rank logs (rare), use the same group and set job_type=f"rank-{rank}" so the UI can compare them as siblings rather than treating them as unrelated experiments.

Offline Mode Default Inherited From .env

A subtle source of “the run never showed up”: a teammate’s .env file (loaded by python-dotenv early in your training script) exports WANDB_MODE=offline. The SDK obeys it without warning. The local directory ./wandb/offline-run-*/ fills with run data, but wandb sync is never invoked. Add assert os.environ.get("WANDB_MODE") in (None, "online") at the top of production training scripts, and document the env var explicitly in any .env.example.

Project Name Case Sensitivity

W&B project names are case-sensitive in the API but case-insensitive in the URL. wandb.init(project="MyProject") and wandb.init(project="myproject") log to two separate projects, but both URLs https://wandb.ai/.../MyProject and .../myproject resolve. Pick one canonical casing per project and lint it in your training launcher. The first run after a casing typo silently creates a duplicate project that you have to clean up by hand.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles