Fix: Weights & Biases (wandb) Not Working — Login Errors, Init Hangs, and Sync Failures
Quick Answer
How to fix wandb errors — API key not set login failed, wandb init hangs, offline mode sync, artifact upload failure, run not showing in dashboard, image logging size limit, and sweep agent not starting.
The Error
You try to log in and wandb complains about missing credentials:
wandb: ERROR Unable to authenticate with W&B. Please check your API key.
wandb.errors.UsageError: api_key not configuredOr wandb.init() hangs forever without connecting:
import wandb
wandb.init(project="my-project")
# Hangs for 5+ minutes, no outputOr training runs succeed but don’t appear in the dashboard:
wandb: Run history: (empty)
wandb: Run summary: (empty)Or you try to log a large image and get a size error:
wandb: WARNING Media of type images exceeded 100MB per-step limitOr hyperparameter sweeps start but agents don’t pick up runs:
wandb: ERROR sweep agent could not connect to sweep controllerwandb is a hosted MLOps platform with a Python SDK that communicates with its backend via HTTP and WebSockets. When the SDK can’t reach the backend, can’t authenticate, or encounters a version/environment mismatch, it fails in specific ways. This guide covers each failure mode.
Why This Happens
wandb runs two processes per training job: the main Python process calling wandb.log(), and a background process that uploads data to the W&B backend. When wandb.init() is called, the SDK spawns the background process and opens a WebSocket connection. If the process can’t start (missing dependencies, permission issues) or the connection can’t be made (firewall, wrong API key), the SDK may hang for a long time before raising.
Authentication requires an API key from wandb.ai. The SDK reads it from ~/.netrc, the WANDB_API_KEY environment variable, or prompts interactively — but only when running in a TTY. In CI or Docker, prompting fails silently.
Fix 1: API Key and Login
wandb: ERROR Unable to authenticate with W&B. Please check your API key.Get your API key from wandb.ai/authorize. Three ways to provide it:
Method 1: Interactive login (dev machine, one-time setup):
wandb login
# Prompts for your API key and saves to ~/.netrcMethod 2: Environment variable (CI, Docker, cloud):
export WANDB_API_KEY=your-key-here
python train.py# In a script — read from env var before wandb.init
import os
os.environ["WANDB_API_KEY"] = os.environ.get("WANDB_API_KEY") or "fallback-key"
import wandb
wandb.init(project="my-project")Method 3: Explicit login in Python:
import wandb
wandb.login(key="your-key-here")
wandb.init(project="my-project")Check current login status:
import wandb
api = wandb.Api()
print(api.viewer.username) # Prints your username if logged inIn Docker/Kubernetes, mount the key as a secret:
FROM python:3.12
RUN pip install wandb
# Never bake the key into the image — pass at runtime
CMD ["python", "train.py"]docker run -e WANDB_API_KEY=xxx my-imageCommon Mistake: Committing ~/.netrc or a script with hardcoded API key to a public repo. W&B keys grant access to all your projects. If you accidentally commit one, rotate it immediately at wandb.ai/settings → Danger Zone → Reset API key.
Fix 2: wandb.init() Hangs
The SDK tries to sync immediately. If the network is slow, the backend is unreachable, or disk I/O is slow, init can hang for minutes.
Set a timeout:
import wandb
wandb.init(
project="my-project",
settings=wandb.Settings(
init_timeout=60, # Give up after 60 seconds (default: 90)
),
)Use offline mode when internet is unreliable:
import wandb
# Environment variable
import os
os.environ["WANDB_MODE"] = "offline"
wandb.init(project="my-project")
# Runs locally, saves to ./wandb/ directory
# Later, sync manually
# wandb sync ./wandb/offline-run-YYYYMMDD_HHMMSS-xxxxx/Disable wandb entirely for debug runs:
export WANDB_MODE=disabled
# wandb calls become no-ops — no network, no disk
python train.py# Or per-run
os.environ["WANDB_MODE"] = "disabled"Check network connectivity to wandb:
curl -I https://api.wandb.ai
# Expected: HTTP/2 200If this fails, your firewall is blocking wandb. Corporate networks often need:
# Proxy environment variables
export HTTPS_PROXY=http://proxy.company.com:8080
export HTTP_PROXY=http://proxy.company.com:8080
# Or set the wandb base URL if your org has a self-hosted instance
export WANDB_BASE_URL=https://wandb.internal.company.comFix 3: Runs Don’t Appear in Dashboard
You log metrics but nothing shows in the wandb UI.
Step 1: Verify the run actually started:
import wandb
run = wandb.init(project="my-project")
print(f"Run URL: {run.url}")
# https://wandb.ai/your-username/my-project/runs/abc123Copy that URL — if the page loads with “Run not found”, the init failed despite no error.
Step 2: Verify metrics are logged with wandb.log():
import wandb
wandb.init(project="my-project")
for step in range(100):
loss = compute_loss()
wandb.log({"loss": loss, "step": step}) # Must be a dict
# wandb.log(loss) # WRONG — must be a dict
wandb.finish() # Explicitly close the runwandb.finish() is important at the end of your script. Without it, the sync process may not flush the last metrics before Python exits.
Step 3: Check the project and entity match your account:
wandb.init(
project="my-project",
entity="my-username", # Or team name — defaults to your personal account
)If entity is wrong (e.g., you’re trying to log to a team you’re not on), init fails silently in some versions.
Step 4: Check for offline mode accidentally enabled:
import os
print(os.environ.get("WANDB_MODE")) # Should be None or "online"If it’s “offline” or “disabled”, unset it:
unset WANDB_MODEFix 4: Logging Different Data Types
wandb logs more than scalars — but each type has constraints.
Scalars:
wandb.log({"loss": 0.23, "accuracy": 0.94, "lr": 1e-4})Images:
import wandb
from PIL import Image
import numpy as np
# PIL Image
wandb.log({"example": wandb.Image(pil_image, caption="Ground truth")})
# NumPy array (H, W, C) uint8
img_arr = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
wandb.log({"random": wandb.Image(img_arr)})
# File path
wandb.log({"saved": wandb.Image("plot.png")})
# Multiple images in one log call
wandb.log({
"examples": [wandb.Image(img, caption=f"Sample {i}") for i, img in enumerate(images)]
})Tables (for complex data):
import wandb
import pandas as pd
df = pd.DataFrame({
"image": [wandb.Image(img) for img in images],
"label": labels,
"prediction": preds,
"correct": [l == p for l, p in zip(labels, preds)],
})
table = wandb.Table(dataframe=df)
wandb.log({"results": table})Media size limits:
- 100 MB per-step for images/video/audio (hard limit)
- 10,000 rows per table (performance warning above this)
- Artifact size limits vary by plan
Pro Tip: Log images at training resolution, not raw sensor size. A 4K image at every step pushes you over the 100MB limit fast. Resize to the actual input size your model sees (224x224 for ImageNet, 512x512 for segmentation) before logging. You can always log a few full-resolution samples separately as an artifact.
Custom x-axis:
# By default, wandb plots against the log step number
wandb.log({"loss": 0.5}) # Step 0
wandb.log({"loss": 0.4}) # Step 1
# Custom step from your training loop
for epoch in range(100):
wandb.log({"loss": loss, "epoch": epoch}, step=epoch)
# Define a metric's x-axis explicitly
wandb.define_metric("epoch")
wandb.define_metric("train_loss", step_metric="epoch")
wandb.log({"epoch": 5, "train_loss": 0.2})
# train_loss plots against epoch, not the internal step counterFix 5: Artifacts — Upload and Download
Artifacts are versioned datasets, models, and other files tracked by wandb.
Upload a model as an artifact:
import wandb
wandb.init(project="my-project")
# Log a model file
artifact = wandb.Artifact(
name="my-model",
type="model",
description="Best checkpoint from epoch 42",
metadata={"accuracy": 0.94, "dataset": "imagenet-v2"},
)
artifact.add_file("model.pt")
wandb.log_artifact(artifact)
wandb.finish()Download an artifact in a later run:
import wandb
run = wandb.init(project="my-project")
# Download the latest version
artifact = run.use_artifact("my-model:latest")
artifact_dir = artifact.download() # Returns local path
print(f"Downloaded to: {artifact_dir}")
# Specific version
artifact = run.use_artifact("my-model:v3")
# From another project or entity
artifact = run.use_artifact("other-team/other-project/my-model:latest")Aliases for easy version management:
# Promote a version to a named alias
artifact = run.use_artifact("my-model:v42")
artifact.aliases.append("production")
artifact.save()
# Later, load by alias
artifact = run.use_artifact("my-model:production")Avoid re-uploading identical files — wandb deduplicates by content hash:
artifact = wandb.Artifact("dataset", type="dataset")
artifact.add_dir("./data") # Uploads only new/changed files
wandb.log_artifact(artifact)Fix 6: Offline Mode and Syncing
For training on air-gapped clusters or when you want to decouple training from upload:
import os
os.environ["WANDB_MODE"] = "offline"
import wandb
wandb.init(project="my-project")
# Metrics save locally to ./wandb/offline-run-*/Sync manually when network becomes available:
# Sync a specific run
wandb sync ./wandb/offline-run-20250409_143022-abc123
# Sync all offline runs
wandb sync ./wandb/
# Sync and delete local copy after upload
wandb sync --clean ./wandb/Dryrun mode (saves metadata but doesn’t sync even with sync command):
os.environ["WANDB_MODE"] = "dryrun" # Local-only, no syncFor MLflow-style experiment tracking with similar offline patterns, see MLflow not working.
Fix 7: Sweeps — Hyperparameter Search
Sweeps run hyperparameter searches across multiple workers. Configuration is stored on the wandb backend; agents poll for trials to run.
Define the sweep:
sweep_config = {
"method": "bayes", # random, grid, or bayes
"metric": {
"name": "val_loss",
"goal": "minimize",
},
"parameters": {
"learning_rate": {"distribution": "log_uniform_values", "min": 1e-5, "max": 1e-1},
"batch_size": {"values": [16, 32, 64, 128]},
"optimizer": {"values": ["adam", "sgd"]},
"dropout": {"distribution": "uniform", "min": 0.0, "max": 0.5},
},
}
sweep_id = wandb.sweep(sweep_config, project="my-project")
print(f"Sweep ID: {sweep_id}")Training function:
def train():
wandb.init()
config = wandb.config # Hyperparameters chosen by the sweep controller
model = build_model(
lr=config.learning_rate,
batch_size=config.batch_size,
optimizer=config.optimizer,
dropout=config.dropout,
)
for epoch in range(10):
train_loss = train_one_epoch(model)
val_loss = validate(model)
wandb.log({"train_loss": train_loss, "val_loss": val_loss, "epoch": epoch})
wandb.finish()Run the agent — each agent picks up trials from the sweep:
# Run 10 trials
wandb.agent(sweep_id, function=train, count=10)
# Or from CLI
# wandb agent your-username/my-project/sweep-idMultiple agents for parallelism — run the agent command on different machines, all pointing at the same sweep_id:
# Machine 1
wandb agent your-username/my-project/sweep-id
# Machine 2 (same sweep)
wandb agent your-username/my-project/sweep-idSweep agent won’t start — usually a connection issue. Check:
wandb verify # Checks network and authFor Optuna-based hyperparameter tuning as an alternative, see Optuna not working.
Fix 8: Framework Integrations
wandb integrates with PyTorch, TensorFlow, Lightning, Transformers, and sklearn.
PyTorch Lightning:
from pytorch_lightning.loggers import WandbLogger
import pytorch_lightning as pl
wandb_logger = WandbLogger(project="my-project", log_model="all")
trainer = pl.Trainer(
logger=wandb_logger,
max_epochs=10,
)
trainer.fit(model, train_loader, val_loader)
# wandb_logger auto-captures all pl.Trainer metricsHuggingFace Transformers:
import os
os.environ["WANDB_PROJECT"] = "my-project"
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
output_dir="./output",
report_to="wandb", # Auto-logs to wandb
run_name="experiment-1",
num_train_epochs=3,
logging_steps=10,
)
trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds)
trainer.train()For HuggingFace-specific issues like HF_TOKEN authentication and model loading, see HuggingFace Transformers not working.
PyTorch (manual):
import wandb
import torch
wandb.init(project="my-project")
wandb.watch(model, log="all", log_freq=100) # Track gradients and weights
for epoch in range(epochs):
for batch in train_loader:
optimizer.zero_grad()
loss = model(batch).loss
loss.backward()
optimizer.step()
wandb.log({"loss": loss.item()})
wandb.finish()For PyTorch gradient logging and CUDA memory issues that affect wandb tracking, see PyTorch not working.
Still Not Working?
Self-Hosted wandb
For enterprise / on-premise deployments:
export WANDB_BASE_URL=https://wandb.internal.company.com
export WANDB_API_KEY=key-from-your-internal-instanceResuming a Run After a Crash
import wandb
# Resume by run ID
run = wandb.init(project="my-project", id="abc123", resume="must")
# Resume if exists, create new if not
run = wandb.init(project="my-project", id="abc123", resume="allow")Disabling Telemetry
os.environ["WANDB_DISABLE_GIT"] = "true" # Don't record git info
os.environ["WANDB_SILENT"] = "true" # Suppress output messages
os.environ["WANDB_CONSOLE"] = "off" # Don't capture stdout/stderrCleaning Up Local wandb Directories
Each run creates a directory in ./wandb/. They accumulate fast:
# List runs
ls ./wandb/
# Remove all local run directories (data already synced to cloud)
wandb artifact cache cleanup 1GB # Keep at most 1GB of cache
rm -rf ./wandb/run-* # Manual cleanup after syncingRun Grouping for Related Experiments
When you run the same training script multiple times (different seeds, slight config tweaks), group them so the dashboard treats them as one experiment:
wandb.init(
project="my-project",
group="resnet50-sweep", # All runs with this group show together
job_type="train", # eval, train, preprocess — for filtering
tags=["baseline", "cuda"], # Free-form labels
)Groups let you aggregate metrics (mean/std across seeds) and compare entire experiment sets in the dashboard.
Programmatic Access to Past Runs
import wandb
api = wandb.Api()
# Get all runs in a project
runs = api.runs("your-username/my-project")
for run in runs:
print(run.name, run.state, run.summary.get("val_loss"))
# Download files from a run
run = api.run("your-username/my-project/abc123")
run.file("model.pt").download(replace=True)
# Best run by metric
import pandas as pd
df = pd.DataFrame([{
"name": r.name,
"val_loss": r.summary.get("val_loss"),
"lr": r.config.get("learning_rate"),
} for r in runs if r.state == "finished"])
best = df.sort_values("val_loss").head(5)
print(best)Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: MLflow Not Working — Tracking URI, Artifact Store, and Model Registry Errors
How to fix MLflow errors — no tracking server, artifact path not accessible, model version not found, experiment not found, MLFLOW_TRACKING_URI not set, autolog not recording metrics, and MLflow UI showing no runs.
Fix: Gradio Not Working — Share Link, Queue Timeout, and Component Errors
How to fix Gradio errors — share link not working, queue timeout, component not updating, Blocks layout mistakes, flagging permission denied, file upload size limit, and HuggingFace Spaces deployment failures.
Fix: Jupyter Notebook Not Working — Kernel Dead, Module Not Found, and Widget Errors
How to fix Jupyter errors — kernel fails to start or dies, ModuleNotFoundError despite pip install, matplotlib plots not showing, ipywidgets not rendering in JupyterLab, port already in use, and jupyter command not found.
Fix: LightGBM Not Working — Installation Errors, Categorical Features, and Training Issues
How to fix LightGBM errors — ImportError libomp libgomp not found, do not support special JSON characters in feature name, categorical feature index out of range, num_leaves vs max_depth overfitting, early stopping callback changes, and GPU build errors.