Fix: joblib Not Working — Parallel Backends, Memory Cache, and Pickling Errors
Quick Answer
How to fix joblib errors — Parallel n_jobs slower than expected, Memory cache miss, backend loky vs threading vs multiprocessing, pickling lambda not supported, dump load file size, and pytest interference.
The Error
You parallelize a loop with Parallel(n_jobs=-1) and it’s slower than serial:
from joblib import Parallel, delayed
import time
def slow(x):
return x ** 2
result = Parallel(n_jobs=-1)(delayed(slow)(i) for i in range(100))
# Slower than: [slow(i) for i in range(100)]Or Memory cache misses for what looks like the same input:
from joblib import Memory
memory = Memory("./cache", verbose=0)
@memory.cache
def expensive(x):
return x ** 2
expensive(1) # Computes
expensive(1) # Computes again — should hit cache but doesn'tOr pickling lambdas fails inside Parallel:
results = Parallel(n_jobs=4)(delayed(lambda x: x ** 2)(i) for i in range(10))
# PicklingError or hangsOr joblib.dump writes giant files:
import numpy as np
from joblib import dump
arr = np.zeros((1000, 1000), dtype=np.float32) # 4 MB
dump(arr, "data.joblib")
# File is 4MB — but other tools compress betterOr pytest sessions hang when tests use joblib:
$ pytest tests/
# Tests using Parallel hang or fail with worker errorsjoblib is the unsung workhorse of the Python scientific stack — used internally by scikit-learn for n_jobs=-1, for caching expensive computations to disk, and for parallel scatter/gather. The default backend (loky) is robust but adds overhead; the threading backend is fast for I/O but limited by the GIL; multiprocessing has pickling constraints. Picking the right backend for the workload is half the battle. This guide covers the common issues.
Why This Happens
Parallel(n_jobs=N) spawns workers (processes by default via loky). Spawning processes has fixed overhead (~50-200ms each); for tiny tasks, that overhead exceeds the savings. Workers also need to pickle the function and arguments — closures over large data, lambdas, and locally-defined functions don’t pickle cleanly.
Memory cache uses a hash of the arguments to key cached results. NumPy arrays, Pandas DataFrames, and most built-ins hash consistently, but mutable objects (sets, dicts modified post-creation) can hash differently between calls — silently missing the cache.
Fix 1: Basic Parallel Usage
from joblib import Parallel, delayed
import math
def slow_computation(x):
return math.sqrt(x ** 4 + x ** 3 + x ** 2 + 1)
# Serial
result = [slow_computation(i) for i in range(1000)]
# Parallel — same result, distributed across cores
result = Parallel(n_jobs=-1)(delayed(slow_computation)(i) for i in range(1000))
# n_jobs=-1 means use all cores; -2 means all but one; etc.delayed() wraps the function call into a “task” object. Without it, the function executes immediately (defeating the parallelism).
Common Mistake: Forgetting delayed:
# WRONG — calls run sequentially, results passed to Parallel as already-computed values
results = Parallel(n_jobs=-1)(slow_computation(i) for i in range(1000))
# CORRECT
results = Parallel(n_jobs=-1)(delayed(slow_computation)(i) for i in range(1000))When parallelism isn’t worth it:
# Each task is microseconds — overhead dominates
results = Parallel(n_jobs=-1)(delayed(lambda x: x * 2)(i) for i in range(100))
# Slower than serial because of pickling + process spawn
# Each task is milliseconds+ — parallelism wins
results = Parallel(n_jobs=-1)(delayed(slow_expensive_function)(i) for i in range(100))Pro Tip: As a rule of thumb, individual tasks should take >10ms each for parallelism to pay off with the default loky backend. For shorter tasks, batch many into a single delayed call:
def batch_process(batch):
return [tiny_compute(x) for x in batch]
# Process 100-item batches in parallel
batches = [range(i, i+100) for i in range(0, 10000, 100)]
results = Parallel(n_jobs=-1)(delayed(batch_process)(b) for b in batches)
flattened = [r for batch in results for r in batch]Fix 2: Choosing the Right Backend
from joblib import Parallel, delayed
# Default — multiprocessing via loky (robust, isolated)
Parallel(n_jobs=-1, backend="loky")(delayed(fn)(i) for i in range(100))
# Threading — fast for I/O-bound, limited by GIL for CPU
Parallel(n_jobs=-1, backend="threading")(delayed(fn)(i) for i in range(100))
# Pure multiprocessing (less robust than loky, similar perf)
Parallel(n_jobs=-1, backend="multiprocessing")(delayed(fn)(i) for i in range(100))
# Sequential (for debugging — runs serially)
Parallel(n_jobs=1)(delayed(fn)(i) for i in range(100))Backend selection table:
| Backend | Best for | Tradeoffs |
|---|---|---|
loky (default) | CPU-bound, robust | High process spawn overhead |
threading | I/O-bound (network, disk) | GIL prevents CPU parallelism |
multiprocessing | CPU-bound | Less robust than loky on macOS |
sequential | Debugging | Just runs serially |
Common Mistake: Using loky for pure I/O work (file reads, HTTP requests). The process overhead dominates — threading is much faster because I/O releases the GIL and threads are nearly free to spawn. For CPU-bound NumPy work, loky is correct because BLAS/MKL release the GIL automatically.
For NumPy / PyTorch / TensorFlow:
# These libraries' C extensions release the GIL during heavy compute
# threading backend often works well for them
Parallel(n_jobs=-1, backend="threading")(
delayed(np.dot)(a, b) for a, b in matrix_pairs
)Fix 3: Pickling Constraints
Workers receive functions and arguments via pickle. Things that don’t pickle:
# WRONG — lambda can't be pickled
Parallel(n_jobs=4)(delayed(lambda x: x ** 2)(i) for i in range(10))
# PicklingError or hang
# WRONG — local function inside another function
def main():
def helper(x):
return x ** 2
Parallel(n_jobs=4)(delayed(helper)(i) for i in range(10))
# CORRECT — top-level function
def helper(x):
return x ** 2
def main():
Parallel(n_jobs=4)(delayed(helper)(i) for i in range(10))Use cloudpickle automatically with loky:
# loky uses cloudpickle by default — handles lambdas, local functions
# But still fails on:
# - Open file handles
# - Database connections
# - Thread/process locks
# - GUI objectscloudpickle is more permissive than stdlib pickle and is loky’s default — most simple closures work. For complex cases, refactor to top-level functions.
Common Mistake: Passing a database connection or open file to a worker. These don’t pickle. Either re-open inside the worker, or pass connection parameters instead:
# WRONG
conn = create_connection()
Parallel(n_jobs=4)(delayed(query)(conn, sql) for sql in sqls)
# CORRECT — open connection in each worker
def query_with_new_conn(sql):
conn = create_connection()
try:
return conn.execute(sql).fetchall()
finally:
conn.close()
Parallel(n_jobs=4)(delayed(query_with_new_conn)(sql) for sql in sqls)For database connection patterns in parallel code, see SQLAlchemy not working and asyncpg not working.
Fix 4: Memory Cache
from joblib import Memory
memory = Memory("./cache_dir", verbose=0)
@memory.cache
def expensive(x, y):
print(f"Computing for {x}, {y}")
return x ** y
expensive(2, 10) # Prints "Computing..." and returns 1024
expensive(2, 10) # No print — cache hit, returns 1024
expensive(3, 10) # Prints "Computing..." — different args, new cache entryCache invalidation:
# Clear all cached results
memory.clear()
# Clear results for a specific function
expensive.clear()
# Force recompute on next call
result = expensive.call_and_shelve(2, 10) # Re-runs, stores freshCache size management:
memory = Memory("./cache_dir", bytes_limit=10 * 1024 * 1024 * 1024, verbose=0)
# 10 GB cap; oldest entries pruned when fullCommon Mistake: Caching functions with non-deterministic behavior. Cache assumes that same args → same result. If your function depends on:
- Current time (
datetime.now()) - Random numbers (without fixed seed)
- External state (DB rows, file contents)
The cache returns stale results without recomputing. Either avoid @memory.cache on these, or include the variable input as a function argument:
# WRONG
@memory.cache
def get_users():
return db.fetch_all("SELECT * FROM users")
# First call caches forever; new users never appear
# CORRECT — include a freshness key
@memory.cache
def get_users(as_of_date):
return db.fetch_all(f"SELECT * FROM users WHERE updated <= '{as_of_date}'")Pro Tip: For per-process caching (no disk), use functools.lru_cache instead. joblib’s Memory is for results that survive process restart and benefit from disk persistence (ML model training, expensive simulations). lru_cache is for in-memory deduplication during a single run — much faster, no disk I/O.
Fix 5: dump / load for Model Persistence
from joblib import dump, load
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Save
dump(model, "model.joblib")
# Load later
loaded_model = load("model.joblib")
predictions = loaded_model.predict(X_test)joblib is the sklearn-recommended format for scikit-learn models — handles NumPy arrays efficiently via memmap.
Compression:
dump(model, "model.joblib.gz", compress=3) # gzip level 3
dump(model, "model.joblib.xz", compress=("xz", 3)) # LZMA
dump(model, "model.joblib.lz4", compress=("lz4", 1)) # LZ4 (fast)Compression tradeoffs:
| Format | Speed | Ratio | Use case |
|---|---|---|---|
| None | Fastest | 1.0x | Local dev, fastest |
gzip (default if compress=N) | Slow | ~3-4x | Standard |
lz4 | Fast | ~2-3x | Production, speed matters |
xz | Slow | ~5-8x | Long-term storage, ratio matters |
Memory-mapped loading for large arrays:
# Don't load into RAM — memory-map from disk
loaded = load("huge_model.joblib", mmap_mode="r")
# Access loaded.feature_importances_ etc. — pages in as accessedFor very large models (multi-GB), memmap avoids loading everything into RAM upfront.
Common Mistake: Using pickle for sklearn models instead of joblib.dump. They both work, but joblib is optimized for NumPy arrays — significantly smaller files for tree-based models, neural networks, anything with weight matrices. Use joblib unless you have a specific reason for pickle.
For NumPy-specific patterns that interact with joblib’s array handling, see NumPy not working.
Fix 6: Progress Bars and Verbose Output
from joblib import Parallel, delayed
# Built-in verbose mode — prints progress to stdout
result = Parallel(n_jobs=-1, verbose=10)(
delayed(slow)(i) for i in range(100)
)
# [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.1s
# [Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 0.5s
# ...Verbose levels (0-50):
0— silent10— periodic progress50— every task
Use tqdm for a nice progress bar:
from tqdm import tqdm
from joblib import Parallel, delayed
def run_with_progress(tasks, fn):
with tqdm(total=len(tasks)) as pbar:
def wrapper(arg):
result = fn(arg)
pbar.update(1)
return result
return Parallel(n_jobs=-1)(delayed(wrapper)(t) for t in tasks)
results = run_with_progress(range(1000), slow_computation)Or use tqdm_joblib:
pip install tqdm-joblibfrom tqdm_joblib import tqdm_joblib
from joblib import Parallel, delayed
with tqdm_joblib(desc="Processing", total=1000):
results = Parallel(n_jobs=-1)(delayed(slow)(i) for i in range(1000))Cleaner integration — progress bar updates as workers finish.
Fix 7: pytest Integration
joblib workers can interfere with pytest’s worker management:
$ pytest tests/
# Hangs or fails in tests that use Parallel(n_jobs=-1)Use n_jobs=1 during testing:
# my_module.py
import os
def compute_parallel(items):
n_jobs = 1 if os.environ.get("TESTING") else -1
return Parallel(n_jobs=n_jobs)(delayed(work)(i) for i in items)Or set joblib’s global default:
# conftest.py
import os
os.environ["JOBLIB_TEMP_FOLDER"] = "/tmp/joblib-tests"
# Optionally force sequential during tests
os.environ["JOBLIB_NUM_THREADS"] = "1"Common Mistake: Mixing pytest-xdist (pytest -n auto) with joblib’s n_jobs=-1. Both spawn workers — combined, you get too many processes, slowdown, sometimes deadlock. Disable joblib parallelism in tests (set n_jobs=1 or use env var to switch).
For pytest async fixture patterns that complement joblib testing, see pytest fixture not found.
Fix 8: Memory and Temp File Management
joblib workers write large arrays to shared memory or /tmp for efficient transfer:
import os
os.environ["JOBLIB_TEMP_FOLDER"] = "/path/to/fast-disk"Default is /tmp — on systems with small /tmp, large parallel jobs fill it up.
Use shared memory for read-only large arrays:
from joblib import Parallel, delayed
import numpy as np
big_array = np.zeros((100_000, 100_000), dtype=np.float32)
# 40 GB array — would be costly to pickle to each worker
# Use memmap so workers share memory
np.save("big_array.npy", big_array)
arr = np.load("big_array.npy", mmap_mode="r")
def process(idx):
return arr[idx].sum()
results = Parallel(n_jobs=8)(delayed(process)(i) for i in range(100_000))
# Workers access shared memory — no per-worker copymax_nbytes parameter controls when joblib auto-memmaps:
Parallel(n_jobs=-1, max_nbytes="1M")(
delayed(fn)(big_array) for _ in range(100)
)
# Args larger than 1MB are memmapped instead of pickledDefault is 1M — usually right; lower for tight memory or higher when pickling overhead matters.
Still Not Working?
joblib vs concurrent.futures vs multiprocessing.Pool
- joblib — Pickling-friendly, integrated with scikit-learn, memory cache. Best for scientific Python.
- concurrent.futures — Stdlib, simpler API, less integrated with sklearn. Best for general async work.
- multiprocessing.Pool — Stdlib, more options, more boilerplate. Use when you need its specific features.
For sklearn / NumPy / SciPy ecosystems, joblib is the path of least resistance. For pure Python with no scientific stack, concurrent.futures is lighter.
Distributed Joblib (Dask Backend)
For scaling beyond one machine:
pip install dask distributedfrom joblib import Parallel, delayed, parallel_backend
from dask.distributed import Client
client = Client("scheduler-address:8786")
with parallel_backend("dask"):
results = Parallel(n_jobs=100)(delayed(fn)(i) for i in range(10000))The Dask backend distributes work across a cluster — scikit-learn’s n_jobs=-1 with the Dask backend scales to hundreds of cores. For Dask-specific patterns, see Dask not working.
Threading Backend with NumPy
import os
# Limit NumPy/BLAS threads BEFORE importing numpy
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
import numpy as np
from joblib import Parallel, delayed
# Now joblib's threading backend gives true parallelism
# Without limiting BLAS, NumPy operations multi-thread internally
# and joblib threading + BLAS threading = oversubscriptionPro Tip: Always set OPENBLAS_NUM_THREADS=1 (or MKL_NUM_THREADS=1) when using joblib’s threading backend with NumPy. Otherwise NumPy spawns BLAS threads on top of joblib’s threads — the OS thrashes between them and performance tanks. With BLAS limited to 1 thread, joblib threading achieves the expected speedup.
Worker Process Lifetime
from joblib import Parallel, delayed
# Default: workers reused for many tasks
Parallel(n_jobs=4)(...)
# Force one task per worker (fresh process each time)
from joblib import parallel_backend
with parallel_backend("loky", inner_max_num_threads=1):
Parallel(n_jobs=4)(...)inner_max_num_threads=1 is useful when workers themselves spawn threads (BLAS, etc.) and you want to limit total parallelism.
Integrating with scikit-learn
scikit-learn uses joblib internally — when you write model.fit(X, y, n_jobs=-1), it uses joblib’s Parallel under the hood:
from sklearn.ensemble import RandomForestClassifier
from joblib import parallel_backend
# Use joblib's threading backend for sklearn
with parallel_backend("threading"):
model = RandomForestClassifier(n_jobs=-1)
model.fit(X_train, y_train)For scikit-learn patterns that benefit from joblib tuning, see scikit-learn not working.
Debugging Worker Failures
from joblib import Parallel, delayed
# Force sequential for debugging
Parallel(n_jobs=1)(delayed(fn)(i) for i in range(10))
# Or set globally
import os
os.environ["JOBLIB_START_METHOD"] = "spawn" # macOS/Windows default
os.environ["JOBLIB_TIMEOUT"] = "300" # Per-task timeout (sec)If a worker silently fails (no error, just hangs), try n_jobs=1 first to surface the actual exception. The parallel wrapper sometimes obscures the underlying error.
Caching in Notebooks
from joblib import Memory
memory = Memory(".cache", verbose=0)
@memory.cache
def expensive_query():
return pd.read_sql("SELECT * FROM huge_table", conn)
# First cell run: queries the DB (slow)
df = expensive_query()
# Re-running the cell: cache hit (instant)
df = expensive_query()Particularly useful in Jupyter where re-running cells is the dev workflow. For Jupyter-specific patterns, see Jupyter not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: scikit-learn Not Working — NotFittedError, NaN Input, Pipeline, and ConvergenceWarning
How to fix scikit-learn errors — NotFittedError call fit before predict, ValueError Input contains NaN, could not convert string to float, Pipeline ColumnTransformer mistakes, cross-validation leakage, n_jobs hanging on Windows, and ConvergenceWarning.
Fix: Marshmallow Not Working — Schema Errors, Load vs Dump, and Field Validation
How to fix Marshmallow errors — Schema not validated on dump, ValidationError messages format, unknown field handling, missing vs default, post_load object construction, and Marshmallow 3 to 4 migration.
Fix: Pipenv Not Working — Lock File Generation, Shell Activation, and Dependency Resolution
How to fix Pipenv errors — pipenv lock takes forever, Pipfile.lock not generated, shell activation broken, no virtualenv created, dependency conflict, and migration to uv or Poetry.
Fix: Copier Not Working — Template Updates, Question Conditions, and Migrations
How to fix Copier errors — copier.yml not found, conditional questions not appearing, update breaks generated project, migrations between versions, Jinja vs YAML escaping, and answers file conflict.