Skip to content

Fix: PyTorch Not Working — CUDA Out of Memory, Device Mismatch, and NaN Loss

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix PyTorch errors — CUDA out of memory, expected all tensors on same device, CUDA device-side assert triggered, torch.cuda.is_available() False, inplace gradient errors, DataLoader Windows crash, dtype mismatch, and NaN loss.

The Error

You start training and GPU memory fills up:

RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB
(GPU 0; 8.00 GiB total capacity; 6.73 GiB already allocated;
1.03 GiB free; 6.89 GiB reserved in total by PyTorch)

Or a forward pass crashes with a device error:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

Or CUDA gives you a cryptic error that points to the wrong line:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

Or training runs fine for a few steps and then loss becomes NaN and stays there:

Epoch 1, Step 50: loss = 0.3421
Epoch 1, Step 51: loss = nan
Epoch 1, Step 52: loss = nan  # Never recovers

Each of these is a distinct failure with a specific fix.

Why This Happens

PyTorch is strict about where tensors live (CPU vs. GPU device index), what type they are (float32 vs. float64), and how they’re modified (inplace vs. out-of-place). The GPU introduces additional failure modes: CUDA errors are reported asynchronously by default, which means the stack trace you see often points to the wrong line. GPU memory is managed by a caching allocator that doesn’t always release memory when you expect it to.

Understanding these mechanics makes the errors predictable rather than mysterious.

Fix 1: CUDA Out of Memory

The error tells you how much was requested, how much is allocated, and how much is free. The gap between “allocated” and “reserved” is cached memory that PyTorch holds but isn’t actively using:

Tried to allocate 2.50 GiB
Already allocated: 6.73 GiB
Free: 1.03 GiB               ← not enough for the request
Reserved: 6.89 GiB           ← includes cached but inactive memory

Option 1: Gradient accumulation (reduces effective batch size without changing the model):

from torch.amp import autocast, GradScaler

accumulation_steps = 4   # Effective batch = actual_batch × 4
scaler = GradScaler(device="cuda")

optimizer.zero_grad()
for i, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)

    with autocast(device_type="cuda", dtype=torch.float16):
        loss = criterion(model(X), y) / accumulation_steps

    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Dividing the loss by accumulation_steps keeps gradients on the same scale as if you’d used the full batch.

Option 2: Mixed precision — halves memory usage for activations and most parameters by running the forward pass in float16:

from torch.amp import autocast, GradScaler

scaler = GradScaler(device="cuda")

for X, y in dataloader:
    X, y = X.to(device), y.to(device)
    optimizer.zero_grad()

    with autocast(device_type="cuda", dtype=torch.float16):
        output = model(X)
        loss = criterion(output, y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

GradScaler prevents float16 underflow during the backward pass. Don’t skip it.

Option 3: Call torch.cuda.empty_cache() — but understand what it does and doesn’t do:

torch.cuda.empty_cache()

This releases cached (reserved but inactive) memory back to the OS. It does not free memory that’s still held by live tensors. If allocated is the problem, not reserved, empty_cache() won’t help. Use torch.cuda.memory_summary() to see which category your memory is in:

print(torch.cuda.memory_summary(device=0))

Option 4: Reduce batch size — the simplest fix if you’re not compute-bound. As a rough guide, halving the batch size frees roughly half the activation memory.

Pro Tip: Wrap your training loop with torch.no_grad() during validation. Every tensor created in a forward pass without no_grad() saves its intermediate values for backprop, doubling memory usage compared to inference:

model.eval()
with torch.no_grad():
    for X, y in val_loader:
        output = model(X.to(device))
        # No gradients tracked — significantly less memory

Fix 2: Device Mismatch — All Tensors Must Be on the Same Device

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

The three most common places this happens:

1. Input tensors not moved to GPU:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

for X, y in dataloader:
    # WRONG — X and y are on CPU, model is on cuda:0
    output = model(X)

    # CORRECT — move inputs before every forward pass
    X, y = X.to(device), y.to(device)
    output = model(X)

2. Loading a checkpoint without map_location:

# WRONG — loads tensors to the device they were saved on
checkpoint = torch.load("model.pth")

# CORRECT — redirect to wherever you need them
checkpoint = torch.load("model.pth", map_location=device, weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])

weights_only=True is recommended in PyTorch 2.x to avoid loading arbitrary Python objects from checkpoint files.

3. Tensors created inside a model without inheriting the device:

class MyModel(nn.Module):
    def forward(self, x):
        # WRONG — hardcoded CPU tensor
        mask = torch.ones(x.shape[0], x.shape[1])

        # CORRECT — match the device of the input tensor
        mask = torch.ones(x.shape[0], x.shape[1], device=x.device)
        return x * mask

Any tensor you create inside a forward() method must explicitly specify device=x.device or be created via operations on existing tensors (which inherit the device automatically).

Fix 3: CUDA Device-Side Assert — Finding the Real Error

This is PyTorch’s most confusing error pattern. The actual cause is hidden because CUDA runs asynchronously:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Step 1: Make CUDA synchronous to get the real stack trace:

# Linux / macOS
CUDA_LAUNCH_BLOCKING=1 python train.py

# Windows PowerShell
$env:CUDA_LAUNCH_BLOCKING=1
python train.py

With CUDA_LAUNCH_BLOCKING=1, the error appears at the correct line.

Step 2: The real error is almost always invalid class indices in CrossEntropyLoss.

CrossEntropyLoss expects labels in [0, num_classes - 1]. If any label equals num_classes or is negative, CUDA triggers the assert:

num_classes = 5
criterion = nn.CrossEntropyLoss()

# WRONG — label 5 is out of range for num_classes=5
labels = torch.tensor([0, 2, 5], device="cuda")  # 5 >= num_classes
logits = model(X)
loss = criterion(logits, labels)  # device-side assert

# CORRECT — labels must be in [0, 4]
labels = torch.tensor([0, 2, 4], device="cuda")
loss = criterion(logits, labels)  # works

Add a validation check before the loss call during debugging:

assert labels.min() >= 0, f"Negative label: {labels.min()}"
assert labels.max() < num_classes, f"Label {labels.max()} >= num_classes {num_classes}"
assert labels.dtype == torch.long, f"Labels must be torch.long, got {labels.dtype}"

Other common causes of device-side asserts: index out of bounds in torch.gather() or torch.index_select(), and NaN values in torch.log() or torch.sqrt().

Fix 4: torch.cuda.is_available() Returns False

After installing PyTorch, this is the first check:

import torch
print(torch.cuda.is_available())  # False — why?
print(torch.version.cuda)         # None — CPU-only build installed

If torch.version.cuda is None, you have a CPU-only PyTorch build installed. The pip install torch default on some systems installs the CPU variant.

Check your system CUDA version:

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver, release 12.1

Reinstall PyTorch with the correct CUDA version. Find your CUDA version from nvcc --version and match it:

# Clear old installation and cache
pip uninstall torch torchvision torchaudio -y
pip cache purge

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

The --index-url flag is required — without it, pip resolves to the CPU build from PyPI.

If nvcc is not found, the CUDA toolkit may not be installed (only the driver is). NVIDIA drivers and the CUDA toolkit are separate packages. Check GPU detection with nvidia-smi — if that works, you have the driver but may be missing the toolkit. The same driver verification steps apply when running local LLMs — see Ollama not working for GPU detection diagnostics you can run independently of PyTorch.

Common Mistake: Having a CUDA 12.x system but installing a PyTorch build for CUDA 11.x. The minor version mismatch is often acceptable (e.g., PyTorch cu121 on a CUDA 12.4 system), but a major version mismatch is not.

Fix 5: Inplace Operation Breaks Gradient Computation

RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.cuda.FloatTensor [128, 10]], which is
output 0 of LinearBackward0, is at version 2; expected version 0.

PyTorch tracks a version counter on every tensor. Inplace operations (+=, [i] = x, .fill_()) increment the version. If the version doesn’t match what the autograd graph recorded, backprop fails.

Replace inplace operations with their out-of-place equivalents:

# WRONG — inplace on a tensor that requires grad
x = torch.randn(10, requires_grad=True)
x += 1      # Inplace: modifies x at version 0 → version 1
y = x * 2
y.sum().backward()  # Error: x already modified

# CORRECT — out-of-place creates a new tensor
x = torch.randn(10, requires_grad=True)
x = x + 1  # New tensor, x is untouched
y = x * 2
y.sum().backward()  # Works

Common pattern in RNN loops — collecting hidden states:

# WRONG — assigning into a pre-allocated tensor
outputs = torch.zeros(seq_len, batch_size, hidden_dim)
for t in range(seq_len):
    h = cell(inputs[t], h)
    outputs[t] = h  # Inplace write into outputs

# CORRECT — accumulate in a list, stack at the end
outputs = []
for t in range(seq_len):
    h = cell(inputs[t], h)
    outputs.append(h)
outputs = torch.stack(outputs)  # Assembles without inplace ops

If you need to write into a pre-allocated buffer and the tensor doesn’t need gradients (output storage, not computation), use .detach() before assignment:

buffer[t] = h.detach()  # Detached — safe to write inplace

Fix 6: DataLoader Crashes on Windows — num_workers Error

RuntimeError: An attempt has been made to start a new process before the current
process has finished its bootstrapping phase. This probably means that you are on
Windows and you forgot to use the proper idiom in the main module:

    if __name__ == '__main__':
        ...

Windows spawns new processes by re-importing the entire script, which causes recursive spawning. Linux forks processes and avoids this.

Fix: wrap everything in if __name__ == '__main__':

import torch
from torch.utils.data import DataLoader, TensorDataset

def train():
    dataset = TensorDataset(torch.randn(1000, 100), torch.randint(0, 10, (1000,)))
    loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.Linear(100, 10).to(device)

    for X, y in loader:
        X, y = X.to(device), y.to(device)
        output = model(X)

if __name__ == "__main__":    # Required on Windows
    train()

Quick fix for notebooks or scripts where restructuring is inconvenient:

loader = DataLoader(dataset, batch_size=32, num_workers=0)  # Disables multiprocessing

num_workers=0 runs data loading in the main process. It’s slower for I/O-bound datasets but avoids the Windows spawn issue entirely.

For DataLoader multiprocessing behavior differences between platforms, see Python multiprocessing not working.

Fix 7: Dtype Mismatch — Float32 vs. Float64

RuntimeError: expected scalar type Float but found Double
RuntimeError: mat1 and mat2 must have the same dtype, but got Double and Float

PyTorch model parameters default to float32. NumPy arrays default to float64. Converting NumPy to a tensor without specifying the dtype preserves float64:

import numpy as np
import torch

data = np.array([[1.0, 2.0, 3.0]])   # float64 by default
tensor = torch.from_numpy(data)       # still float64 (Double)

model = torch.nn.Linear(3, 1)        # float32
output = model(tensor)               # RuntimeError: float32 ≠ float64

Fix: convert to float32 explicitly:

# Option 1 — call .float() to convert to float32
tensor = torch.from_numpy(data).float()

# Option 2 — specify dtype during tensor creation
tensor = torch.tensor(data, dtype=torch.float32)

# Option 3 — convert the numpy array first
data = data.astype(np.float32)
tensor = torch.from_numpy(data)

If the mismatch is inside your model (e.g., a custom layer creates float64 constants):

class MyLayer(nn.Module):
    def forward(self, x):
        # WRONG — np.pi is a float64 Python scalar that becomes float64 tensor
        scale = torch.tensor(np.pi)

        # CORRECT — match the input's dtype
        scale = torch.tensor(np.pi, dtype=x.dtype)
        return x * scale

Fix 8: NaN Loss

Loss becoming NaN mid-training and never recovering is almost always one of three causes: a numerical singularity in the loss function, exploding gradients, or a learning rate that’s too high.

First: check where NaN enters. Add detection to your training loop:

for step, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)
    optimizer.zero_grad()

    output = model(X)
    loss = criterion(output, y)

    if torch.isnan(loss):
        print(f"NaN at step {step}")
        print(f"  output range: [{output.min():.3f}, {output.max():.3f}]")
        print(f"  output has NaN: {torch.isnan(output).any()}")
        break  # Stop before NaN propagates into weights

    loss.backward()
    optimizer.step()

Gradient clipping is the standard fix for exploding gradients. Apply it after loss.backward() and before optimizer.step():

loss.backward()

# Clip all parameter gradients to a maximum L2 norm of 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()

clip_grad_norm_ scales the entire gradient vector down if its norm exceeds max_norm. Values between 0.5 and 5.0 are common depending on the architecture.

Learning rate is the other frequent cause. If NaN appears in the first few steps, try reducing lr by 10x:

# Start conservative
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Rather than
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

With mixed precision, GradScaler handles the numerical instability of float16 automatically. If you’re using AMP without GradScaler, the combination of float16 and large gradients is a common NaN source:

from torch.amp import autocast, GradScaler

scaler = GradScaler(device="cuda")

for X, y in dataloader:
    optimizer.zero_grad()

    with autocast(device_type="cuda", dtype=torch.float16):
        output = model(X.to(device))
        loss = criterion(output, y.to(device))

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)                             # Unscale before clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Call scaler.unscale_(optimizer) before clip_grad_norm_ — otherwise you’re clipping the scaled gradients, not the true ones.

PyTorch vs TensorFlow vs JAX vs Lightning vs MosaicML vs DeepSpeed

Most of the errors above are about PyTorch’s eager-mode + caching-allocator design. Switching frameworks does not “fix” your code, but knowing how each one handles the same problem helps you decide whether to climb the stack (Lightning, MosaicML) or change the runtime entirely (TensorFlow, JAX).

TensorFlow has a graph-mode default (tf.function) with eager execution layered on top — the opposite of PyTorch’s design. CUDA OOM looks the same, but tf.config.experimental.set_memory_growth replaces PyTorch’s caching allocator behavior. Device placement errors are rarer because TF auto-places tensors, but debugging that auto-placement is harder. NaN losses in TF are tracked via tf.debugging.check_numerics rather than per-tensor isnan. For the TF-specific equivalents of these failure modes, see tensorflow not working.

JAX is functional and trace-compiled. There is no .to(device) — XLA decides placement. There is no inplace operation problem because everything is immutable. The flip side: shapes must be static, control flow must be jax.lax.cond or jax.lax.scan, and Python-side print does not work inside jit. JAX errors are concentrated at trace time, not run time — the opposite of CUDA’s async asserts. Trade async surprise for trace-time strictness.

PyTorch Lightning wraps the same PyTorch you already use. It does not change the failure modes — it just provides a LightningModule with training_step, validation_step, and built-in distributed strategies. If your CUDA OOM is from manual optimizer.zero_grad() mistakes or missed model.eval(), Lightning eliminates that class of bug. Mixed precision becomes precision="16-mixed" instead of manual GradScaler.

MosaicML Composer is a higher-level training library aimed at LLM-scale runs. It bundles Lightning-style structure with built-in speedups (selective backprop, ALiBi, FSDP defaults). For single-GPU prototypes it is overkill; for 70B-class training it removes a class of distributed setup errors.

DeepSpeed is what you reach for when a 70B model does not fit. ZeRO Stage 3 shards optimizer states, gradients, and parameters across GPUs — so a model that would OOM on one card runs across eight. Activation checkpointing combined with offload-to-CPU/NVMe extends this further. DeepSpeed errors are about communication topology (NCCL timeout, mismatched world size) rather than the local CUDA errors covered above. If you are scaling beyond a single node and seeing the same CUDA OOM no matter what you cut, the answer is sharding, not smaller batches.

Eager vs graph is the deepest split. PyTorch eager + torch.compile() is the modern compromise — write Python, compile hot paths. TF + tf.function and JAX + jit start from the opposite end. The cost of switching is rarely the API; it is rewriting your debugging mental model.

Still Not Working?

Model Not Training — Loss Not Decreasing

If loss prints but never changes, check these in order:

  1. optimizer.zero_grad() is missing — gradients accumulate across steps and explode
  2. loss.backward() is called on the wrong tensor — you’re differentiating a detached or constant value
  3. Model is in eval modemodel.eval() disables dropout and batchnorm tracking; call model.train() at the start of each epoch
model.train()           # Back to training mode
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()

Slow Training — GPU Utilization Low

If nvidia-smi shows low GPU utilization (<50%), the bottleneck is usually the CPU data pipeline:

loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Parallel data loading (Linux/macOS)
    pin_memory=True,         # Faster host-to-GPU transfer
    persistent_workers=True, # Keep worker processes alive between epochs
    prefetch_factor=2,       # Pre-load 2 batches per worker
)

pin_memory=True pins CPU memory for faster CUDA transfers. Only use it when training on GPU.

RuntimeError: Expected input batch_size to match target batch_size

The batch dimension doesn’t match between your model output and your labels. Common cause: the last batch in an epoch has fewer samples than batch_size. Either use drop_last=True in your DataLoader or make your loss function handle variable batch sizes (most built-in losses do):

loader = DataLoader(dataset, batch_size=32, drop_last=True)

Checking PyTorch Compile Status

torch.compile() (introduced in PyTorch 2.0) can speed up training by up to 2x but adds a one-time compilation overhead on the first batch. If the compiled model crashes but the eager model works, disable compile to isolate the issue:

model = MyModel().to(device)
# model = torch.compile(model)  # Comment out to debug

for X, y in dataloader:
    loss = criterion(model(X.to(device)), y.to(device))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

For Python-level concurrency errors that surface in training pipelines, see Python multiprocessing not working.

If you’re building LLM pipelines that call PyTorch models through a chain or agent, see LangChain Python not working for the integration patterns between LangChain and custom torch inference code.

RuntimeError: CUDA error: an illegal memory access was encountered

This is the second most common async CUDA error after device-side assert. The culprit is almost always an out-of-bounds index in a custom CUDA kernel or torch.gather with a malformed index tensor. CUDA_LAUNCH_BLOCKING=1 plus TORCH_USE_CUDA_DSA=1 (PyTorch 2.x device-side assertions) gives you the actual offending line. If both flags do not surface a clear cause, suspect mixed-precision overflow: a float16 activation that became inf and then was used as an index.

Model Loaded From Hugging Face Diverges From Reference

If you load a model via transformers and outputs differ from the official demo, the dtype is the most likely cause. Hugging Face defaults to float32 unless you pass torch_dtype=torch.float16 or torch.bfloat16. Reference benchmarks usually run in bfloat16. Match the dtype before chasing model bugs. The full quantization and device_map story is in huggingface transformers not working.

Distributed Training Hangs on dist.init_process_group

A common single-node deadlock: MASTER_ADDR and MASTER_PORT are not set, NCCL falls back to its default rendezvous, and the worker processes wait for each other indefinitely. Set both env vars explicitly and verify NCCL can reach all peers with NCCL_DEBUG=INFO. For Kubernetes-orchestrated training jobs, the readiness probe firing before the leader is ready also produces an identical hang; the orchestration side is covered in helm not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles