Skip to content

Fix: Weights & Biases (wandb) Not Working — Login Errors, Init Hangs, and Sync Failures

FixDevs ·

Quick Answer

How to fix wandb errors — API key not set login failed, wandb init hangs, offline mode sync, artifact upload failure, run not showing in dashboard, image logging size limit, and sweep agent not starting.

The Error

You try to log in and wandb complains about missing credentials:

wandb: ERROR Unable to authenticate with W&B. Please check your API key.
wandb.errors.UsageError: api_key not configured

Or wandb.init() hangs forever without connecting:

import wandb
wandb.init(project="my-project")
# Hangs for 5+ minutes, no output

Or training runs succeed but don’t appear in the dashboard:

wandb: Run history: (empty)
wandb: Run summary: (empty)

Or you try to log a large image and get a size error:

wandb: WARNING Media of type images exceeded 100MB per-step limit

Or hyperparameter sweeps start but agents don’t pick up runs:

wandb: ERROR sweep agent could not connect to sweep controller

wandb is a hosted MLOps platform with a Python SDK that communicates with its backend via HTTP and WebSockets. When the SDK can’t reach the backend, can’t authenticate, or encounters a version/environment mismatch, it fails in specific ways. This guide covers each failure mode.

Why This Happens

wandb runs two processes per training job: the main Python process calling wandb.log(), and a background process that uploads data to the W&B backend. When wandb.init() is called, the SDK spawns the background process and opens a WebSocket connection. If the process can’t start (missing dependencies, permission issues) or the connection can’t be made (firewall, wrong API key), the SDK may hang for a long time before raising.

Authentication requires an API key from wandb.ai. The SDK reads it from ~/.netrc, the WANDB_API_KEY environment variable, or prompts interactively — but only when running in a TTY. In CI or Docker, prompting fails silently.

Fix 1: API Key and Login

wandb: ERROR Unable to authenticate with W&B. Please check your API key.

Get your API key from wandb.ai/authorize. Three ways to provide it:

Method 1: Interactive login (dev machine, one-time setup):

wandb login
# Prompts for your API key and saves to ~/.netrc

Method 2: Environment variable (CI, Docker, cloud):

export WANDB_API_KEY=your-key-here
python train.py
# In a script — read from env var before wandb.init
import os
os.environ["WANDB_API_KEY"] = os.environ.get("WANDB_API_KEY") or "fallback-key"

import wandb
wandb.init(project="my-project")

Method 3: Explicit login in Python:

import wandb
wandb.login(key="your-key-here")
wandb.init(project="my-project")

Check current login status:

import wandb
api = wandb.Api()
print(api.viewer.username)   # Prints your username if logged in

In Docker/Kubernetes, mount the key as a secret:

FROM python:3.12
RUN pip install wandb
# Never bake the key into the image — pass at runtime
CMD ["python", "train.py"]
docker run -e WANDB_API_KEY=xxx my-image

Common Mistake: Committing ~/.netrc or a script with hardcoded API key to a public repo. W&B keys grant access to all your projects. If you accidentally commit one, rotate it immediately at wandb.ai/settings → Danger Zone → Reset API key.

Fix 2: wandb.init() Hangs

The SDK tries to sync immediately. If the network is slow, the backend is unreachable, or disk I/O is slow, init can hang for minutes.

Set a timeout:

import wandb

wandb.init(
    project="my-project",
    settings=wandb.Settings(
        init_timeout=60,   # Give up after 60 seconds (default: 90)
    ),
)

Use offline mode when internet is unreliable:

import wandb

# Environment variable
import os
os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")
# Runs locally, saves to ./wandb/ directory

# Later, sync manually
# wandb sync ./wandb/offline-run-YYYYMMDD_HHMMSS-xxxxx/

Disable wandb entirely for debug runs:

export WANDB_MODE=disabled
# wandb calls become no-ops — no network, no disk
python train.py
# Or per-run
os.environ["WANDB_MODE"] = "disabled"

Check network connectivity to wandb:

curl -I https://api.wandb.ai
# Expected: HTTP/2 200

If this fails, your firewall is blocking wandb. Corporate networks often need:

# Proxy environment variables
export HTTPS_PROXY=http://proxy.company.com:8080
export HTTP_PROXY=http://proxy.company.com:8080

# Or set the wandb base URL if your org has a self-hosted instance
export WANDB_BASE_URL=https://wandb.internal.company.com

Fix 3: Runs Don’t Appear in Dashboard

You log metrics but nothing shows in the wandb UI.

Step 1: Verify the run actually started:

import wandb

run = wandb.init(project="my-project")
print(f"Run URL: {run.url}")
# https://wandb.ai/your-username/my-project/runs/abc123

Copy that URL — if the page loads with “Run not found”, the init failed despite no error.

Step 2: Verify metrics are logged with wandb.log():

import wandb

wandb.init(project="my-project")

for step in range(100):
    loss = compute_loss()
    wandb.log({"loss": loss, "step": step})   # Must be a dict
    # wandb.log(loss)   # WRONG — must be a dict

wandb.finish()   # Explicitly close the run

wandb.finish() is important at the end of your script. Without it, the sync process may not flush the last metrics before Python exits.

Step 3: Check the project and entity match your account:

wandb.init(
    project="my-project",
    entity="my-username",   # Or team name — defaults to your personal account
)

If entity is wrong (e.g., you’re trying to log to a team you’re not on), init fails silently in some versions.

Step 4: Check for offline mode accidentally enabled:

import os
print(os.environ.get("WANDB_MODE"))   # Should be None or "online"

If it’s “offline” or “disabled”, unset it:

unset WANDB_MODE

Fix 4: Logging Different Data Types

wandb logs more than scalars — but each type has constraints.

Scalars:

wandb.log({"loss": 0.23, "accuracy": 0.94, "lr": 1e-4})

Images:

import wandb
from PIL import Image
import numpy as np

# PIL Image
wandb.log({"example": wandb.Image(pil_image, caption="Ground truth")})

# NumPy array (H, W, C) uint8
img_arr = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
wandb.log({"random": wandb.Image(img_arr)})

# File path
wandb.log({"saved": wandb.Image("plot.png")})

# Multiple images in one log call
wandb.log({
    "examples": [wandb.Image(img, caption=f"Sample {i}") for i, img in enumerate(images)]
})

Tables (for complex data):

import wandb
import pandas as pd

df = pd.DataFrame({
    "image": [wandb.Image(img) for img in images],
    "label": labels,
    "prediction": preds,
    "correct": [l == p for l, p in zip(labels, preds)],
})

table = wandb.Table(dataframe=df)
wandb.log({"results": table})

Media size limits:

  • 100 MB per-step for images/video/audio (hard limit)
  • 10,000 rows per table (performance warning above this)
  • Artifact size limits vary by plan

Pro Tip: Log images at training resolution, not raw sensor size. A 4K image at every step pushes you over the 100MB limit fast. Resize to the actual input size your model sees (224x224 for ImageNet, 512x512 for segmentation) before logging. You can always log a few full-resolution samples separately as an artifact.

Custom x-axis:

# By default, wandb plots against the log step number
wandb.log({"loss": 0.5})            # Step 0
wandb.log({"loss": 0.4})            # Step 1

# Custom step from your training loop
for epoch in range(100):
    wandb.log({"loss": loss, "epoch": epoch}, step=epoch)

# Define a metric's x-axis explicitly
wandb.define_metric("epoch")
wandb.define_metric("train_loss", step_metric="epoch")
wandb.log({"epoch": 5, "train_loss": 0.2})
# train_loss plots against epoch, not the internal step counter

Fix 5: Artifacts — Upload and Download

Artifacts are versioned datasets, models, and other files tracked by wandb.

Upload a model as an artifact:

import wandb

wandb.init(project="my-project")

# Log a model file
artifact = wandb.Artifact(
    name="my-model",
    type="model",
    description="Best checkpoint from epoch 42",
    metadata={"accuracy": 0.94, "dataset": "imagenet-v2"},
)
artifact.add_file("model.pt")
wandb.log_artifact(artifact)

wandb.finish()

Download an artifact in a later run:

import wandb

run = wandb.init(project="my-project")

# Download the latest version
artifact = run.use_artifact("my-model:latest")
artifact_dir = artifact.download()   # Returns local path
print(f"Downloaded to: {artifact_dir}")

# Specific version
artifact = run.use_artifact("my-model:v3")

# From another project or entity
artifact = run.use_artifact("other-team/other-project/my-model:latest")

Aliases for easy version management:

# Promote a version to a named alias
artifact = run.use_artifact("my-model:v42")
artifact.aliases.append("production")
artifact.save()

# Later, load by alias
artifact = run.use_artifact("my-model:production")

Avoid re-uploading identical files — wandb deduplicates by content hash:

artifact = wandb.Artifact("dataset", type="dataset")
artifact.add_dir("./data")   # Uploads only new/changed files
wandb.log_artifact(artifact)

Fix 6: Offline Mode and Syncing

For training on air-gapped clusters or when you want to decouple training from upload:

import os
os.environ["WANDB_MODE"] = "offline"

import wandb
wandb.init(project="my-project")
# Metrics save locally to ./wandb/offline-run-*/

Sync manually when network becomes available:

# Sync a specific run
wandb sync ./wandb/offline-run-20250409_143022-abc123

# Sync all offline runs
wandb sync ./wandb/

# Sync and delete local copy after upload
wandb sync --clean ./wandb/

Dryrun mode (saves metadata but doesn’t sync even with sync command):

os.environ["WANDB_MODE"] = "dryrun"   # Local-only, no sync

For MLflow-style experiment tracking with similar offline patterns, see MLflow not working.

Sweeps run hyperparameter searches across multiple workers. Configuration is stored on the wandb backend; agents poll for trials to run.

Define the sweep:

sweep_config = {
    "method": "bayes",   # random, grid, or bayes
    "metric": {
        "name": "val_loss",
        "goal": "minimize",
    },
    "parameters": {
        "learning_rate": {"distribution": "log_uniform_values", "min": 1e-5, "max": 1e-1},
        "batch_size": {"values": [16, 32, 64, 128]},
        "optimizer": {"values": ["adam", "sgd"]},
        "dropout": {"distribution": "uniform", "min": 0.0, "max": 0.5},
    },
}

sweep_id = wandb.sweep(sweep_config, project="my-project")
print(f"Sweep ID: {sweep_id}")

Training function:

def train():
    wandb.init()
    config = wandb.config   # Hyperparameters chosen by the sweep controller

    model = build_model(
        lr=config.learning_rate,
        batch_size=config.batch_size,
        optimizer=config.optimizer,
        dropout=config.dropout,
    )

    for epoch in range(10):
        train_loss = train_one_epoch(model)
        val_loss = validate(model)
        wandb.log({"train_loss": train_loss, "val_loss": val_loss, "epoch": epoch})

    wandb.finish()

Run the agent — each agent picks up trials from the sweep:

# Run 10 trials
wandb.agent(sweep_id, function=train, count=10)

# Or from CLI
# wandb agent your-username/my-project/sweep-id

Multiple agents for parallelism — run the agent command on different machines, all pointing at the same sweep_id:

# Machine 1
wandb agent your-username/my-project/sweep-id

# Machine 2 (same sweep)
wandb agent your-username/my-project/sweep-id

Sweep agent won’t start — usually a connection issue. Check:

wandb verify   # Checks network and auth

For Optuna-based hyperparameter tuning as an alternative, see Optuna not working.

Fix 8: Framework Integrations

wandb integrates with PyTorch, TensorFlow, Lightning, Transformers, and sklearn.

PyTorch Lightning:

from pytorch_lightning.loggers import WandbLogger
import pytorch_lightning as pl

wandb_logger = WandbLogger(project="my-project", log_model="all")

trainer = pl.Trainer(
    logger=wandb_logger,
    max_epochs=10,
)
trainer.fit(model, train_loader, val_loader)
# wandb_logger auto-captures all pl.Trainer metrics

HuggingFace Transformers:

import os
os.environ["WANDB_PROJECT"] = "my-project"

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./output",
    report_to="wandb",   # Auto-logs to wandb
    run_name="experiment-1",
    num_train_epochs=3,
    logging_steps=10,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()

For HuggingFace-specific issues like HF_TOKEN authentication and model loading, see HuggingFace Transformers not working.

PyTorch (manual):

import wandb
import torch

wandb.init(project="my-project")
wandb.watch(model, log="all", log_freq=100)   # Track gradients and weights

for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = model(batch).loss
        loss.backward()
        optimizer.step()
        wandb.log({"loss": loss.item()})

wandb.finish()

For PyTorch gradient logging and CUDA memory issues that affect wandb tracking, see PyTorch not working.

Still Not Working?

Self-Hosted wandb

For enterprise / on-premise deployments:

export WANDB_BASE_URL=https://wandb.internal.company.com
export WANDB_API_KEY=key-from-your-internal-instance

Resuming a Run After a Crash

import wandb

# Resume by run ID
run = wandb.init(project="my-project", id="abc123", resume="must")

# Resume if exists, create new if not
run = wandb.init(project="my-project", id="abc123", resume="allow")

Disabling Telemetry

os.environ["WANDB_DISABLE_GIT"] = "true"      # Don't record git info
os.environ["WANDB_SILENT"] = "true"            # Suppress output messages
os.environ["WANDB_CONSOLE"] = "off"            # Don't capture stdout/stderr

Cleaning Up Local wandb Directories

Each run creates a directory in ./wandb/. They accumulate fast:

# List runs
ls ./wandb/

# Remove all local run directories (data already synced to cloud)
wandb artifact cache cleanup 1GB   # Keep at most 1GB of cache
rm -rf ./wandb/run-*               # Manual cleanup after syncing

When you run the same training script multiple times (different seeds, slight config tweaks), group them so the dashboard treats them as one experiment:

wandb.init(
    project="my-project",
    group="resnet50-sweep",      # All runs with this group show together
    job_type="train",            # eval, train, preprocess — for filtering
    tags=["baseline", "cuda"],   # Free-form labels
)

Groups let you aggregate metrics (mean/std across seeds) and compare entire experiment sets in the dashboard.

Programmatic Access to Past Runs

import wandb

api = wandb.Api()

# Get all runs in a project
runs = api.runs("your-username/my-project")
for run in runs:
    print(run.name, run.state, run.summary.get("val_loss"))

# Download files from a run
run = api.run("your-username/my-project/abc123")
run.file("model.pt").download(replace=True)

# Best run by metric
import pandas as pd
df = pd.DataFrame([{
    "name": r.name,
    "val_loss": r.summary.get("val_loss"),
    "lr": r.config.get("learning_rate"),
} for r in runs if r.state == "finished"])
best = df.sort_values("val_loss").head(5)
print(best)
F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles