Skip to content

Fix: joblib Not Working — Parallel Backends, Memory Cache, and Pickling Errors

FixDevs ·

Quick Answer

How to fix joblib errors — Parallel n_jobs slower than expected, Memory cache miss, backend loky vs threading vs multiprocessing, pickling lambda not supported, dump load file size, and pytest interference.

The Error

You parallelize a loop with Parallel(n_jobs=-1) and it’s slower than serial:

from joblib import Parallel, delayed
import time

def slow(x):
    return x ** 2

result = Parallel(n_jobs=-1)(delayed(slow)(i) for i in range(100))
# Slower than: [slow(i) for i in range(100)]

Or Memory cache misses for what looks like the same input:

from joblib import Memory

memory = Memory("./cache", verbose=0)

@memory.cache
def expensive(x):
    return x ** 2

expensive(1)        # Computes
expensive(1)        # Computes again — should hit cache but doesn't

Or pickling lambdas fails inside Parallel:

results = Parallel(n_jobs=4)(delayed(lambda x: x ** 2)(i) for i in range(10))
# PicklingError or hangs

Or joblib.dump writes giant files:

import numpy as np
from joblib import dump

arr = np.zeros((1000, 1000), dtype=np.float32)   # 4 MB
dump(arr, "data.joblib")
# File is 4MB — but other tools compress better

Or pytest sessions hang when tests use joblib:

$ pytest tests/
# Tests using Parallel hang or fail with worker errors

joblib is the unsung workhorse of the Python scientific stack — used internally by scikit-learn for n_jobs=-1, for caching expensive computations to disk, and for parallel scatter/gather. The default backend (loky) is robust but adds overhead; the threading backend is fast for I/O but limited by the GIL; multiprocessing has pickling constraints. Picking the right backend for the workload is half the battle. This guide covers the common issues.

Why This Happens

Parallel(n_jobs=N) spawns workers (processes by default via loky). Spawning processes has fixed overhead (~50-200ms each); for tiny tasks, that overhead exceeds the savings. Workers also need to pickle the function and arguments — closures over large data, lambdas, and locally-defined functions don’t pickle cleanly.

Memory cache uses a hash of the arguments to key cached results. NumPy arrays, Pandas DataFrames, and most built-ins hash consistently, but mutable objects (sets, dicts modified post-creation) can hash differently between calls — silently missing the cache.

Fix 1: Basic Parallel Usage

from joblib import Parallel, delayed
import math

def slow_computation(x):
    return math.sqrt(x ** 4 + x ** 3 + x ** 2 + 1)

# Serial
result = [slow_computation(i) for i in range(1000)]

# Parallel — same result, distributed across cores
result = Parallel(n_jobs=-1)(delayed(slow_computation)(i) for i in range(1000))
# n_jobs=-1 means use all cores; -2 means all but one; etc.

delayed() wraps the function call into a “task” object. Without it, the function executes immediately (defeating the parallelism).

Common Mistake: Forgetting delayed:

# WRONG — calls run sequentially, results passed to Parallel as already-computed values
results = Parallel(n_jobs=-1)(slow_computation(i) for i in range(1000))

# CORRECT
results = Parallel(n_jobs=-1)(delayed(slow_computation)(i) for i in range(1000))

When parallelism isn’t worth it:

# Each task is microseconds — overhead dominates
results = Parallel(n_jobs=-1)(delayed(lambda x: x * 2)(i) for i in range(100))
# Slower than serial because of pickling + process spawn

# Each task is milliseconds+ — parallelism wins
results = Parallel(n_jobs=-1)(delayed(slow_expensive_function)(i) for i in range(100))

Pro Tip: As a rule of thumb, individual tasks should take >10ms each for parallelism to pay off with the default loky backend. For shorter tasks, batch many into a single delayed call:

def batch_process(batch):
    return [tiny_compute(x) for x in batch]

# Process 100-item batches in parallel
batches = [range(i, i+100) for i in range(0, 10000, 100)]
results = Parallel(n_jobs=-1)(delayed(batch_process)(b) for b in batches)
flattened = [r for batch in results for r in batch]

Fix 2: Choosing the Right Backend

from joblib import Parallel, delayed

# Default — multiprocessing via loky (robust, isolated)
Parallel(n_jobs=-1, backend="loky")(delayed(fn)(i) for i in range(100))

# Threading — fast for I/O-bound, limited by GIL for CPU
Parallel(n_jobs=-1, backend="threading")(delayed(fn)(i) for i in range(100))

# Pure multiprocessing (less robust than loky, similar perf)
Parallel(n_jobs=-1, backend="multiprocessing")(delayed(fn)(i) for i in range(100))

# Sequential (for debugging — runs serially)
Parallel(n_jobs=1)(delayed(fn)(i) for i in range(100))

Backend selection table:

BackendBest forTradeoffs
loky (default)CPU-bound, robustHigh process spawn overhead
threadingI/O-bound (network, disk)GIL prevents CPU parallelism
multiprocessingCPU-boundLess robust than loky on macOS
sequentialDebuggingJust runs serially

Common Mistake: Using loky for pure I/O work (file reads, HTTP requests). The process overhead dominates — threading is much faster because I/O releases the GIL and threads are nearly free to spawn. For CPU-bound NumPy work, loky is correct because BLAS/MKL release the GIL automatically.

For NumPy / PyTorch / TensorFlow:

# These libraries' C extensions release the GIL during heavy compute
# threading backend often works well for them
Parallel(n_jobs=-1, backend="threading")(
    delayed(np.dot)(a, b) for a, b in matrix_pairs
)

Fix 3: Pickling Constraints

Workers receive functions and arguments via pickle. Things that don’t pickle:

# WRONG — lambda can't be pickled
Parallel(n_jobs=4)(delayed(lambda x: x ** 2)(i) for i in range(10))
# PicklingError or hang

# WRONG — local function inside another function
def main():
    def helper(x):
        return x ** 2
    Parallel(n_jobs=4)(delayed(helper)(i) for i in range(10))

# CORRECT — top-level function
def helper(x):
    return x ** 2

def main():
    Parallel(n_jobs=4)(delayed(helper)(i) for i in range(10))

Use cloudpickle automatically with loky:

# loky uses cloudpickle by default — handles lambdas, local functions
# But still fails on:
# - Open file handles
# - Database connections
# - Thread/process locks
# - GUI objects

cloudpickle is more permissive than stdlib pickle and is loky’s default — most simple closures work. For complex cases, refactor to top-level functions.

Common Mistake: Passing a database connection or open file to a worker. These don’t pickle. Either re-open inside the worker, or pass connection parameters instead:

# WRONG
conn = create_connection()
Parallel(n_jobs=4)(delayed(query)(conn, sql) for sql in sqls)

# CORRECT — open connection in each worker
def query_with_new_conn(sql):
    conn = create_connection()
    try:
        return conn.execute(sql).fetchall()
    finally:
        conn.close()

Parallel(n_jobs=4)(delayed(query_with_new_conn)(sql) for sql in sqls)

For database connection patterns in parallel code, see SQLAlchemy not working and asyncpg not working.

Fix 4: Memory Cache

from joblib import Memory

memory = Memory("./cache_dir", verbose=0)

@memory.cache
def expensive(x, y):
    print(f"Computing for {x}, {y}")
    return x ** y

expensive(2, 10)   # Prints "Computing..." and returns 1024
expensive(2, 10)   # No print — cache hit, returns 1024
expensive(3, 10)   # Prints "Computing..." — different args, new cache entry

Cache invalidation:

# Clear all cached results
memory.clear()

# Clear results for a specific function
expensive.clear()

# Force recompute on next call
result = expensive.call_and_shelve(2, 10)   # Re-runs, stores fresh

Cache size management:

memory = Memory("./cache_dir", bytes_limit=10 * 1024 * 1024 * 1024, verbose=0)
# 10 GB cap; oldest entries pruned when full

Common Mistake: Caching functions with non-deterministic behavior. Cache assumes that same args → same result. If your function depends on:

  • Current time (datetime.now())
  • Random numbers (without fixed seed)
  • External state (DB rows, file contents)

The cache returns stale results without recomputing. Either avoid @memory.cache on these, or include the variable input as a function argument:

# WRONG
@memory.cache
def get_users():
    return db.fetch_all("SELECT * FROM users")
# First call caches forever; new users never appear

# CORRECT — include a freshness key
@memory.cache
def get_users(as_of_date):
    return db.fetch_all(f"SELECT * FROM users WHERE updated <= '{as_of_date}'")

Pro Tip: For per-process caching (no disk), use functools.lru_cache instead. joblib’s Memory is for results that survive process restart and benefit from disk persistence (ML model training, expensive simulations). lru_cache is for in-memory deduplication during a single run — much faster, no disk I/O.

Fix 5: dump / load for Model Persistence

from joblib import dump, load
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)

# Save
dump(model, "model.joblib")

# Load later
loaded_model = load("model.joblib")
predictions = loaded_model.predict(X_test)

joblib is the sklearn-recommended format for scikit-learn models — handles NumPy arrays efficiently via memmap.

Compression:

dump(model, "model.joblib.gz", compress=3)   # gzip level 3
dump(model, "model.joblib.xz", compress=("xz", 3))   # LZMA
dump(model, "model.joblib.lz4", compress=("lz4", 1))   # LZ4 (fast)

Compression tradeoffs:

FormatSpeedRatioUse case
NoneFastest1.0xLocal dev, fastest
gzip (default if compress=N)Slow~3-4xStandard
lz4Fast~2-3xProduction, speed matters
xzSlow~5-8xLong-term storage, ratio matters

Memory-mapped loading for large arrays:

# Don't load into RAM — memory-map from disk
loaded = load("huge_model.joblib", mmap_mode="r")
# Access loaded.feature_importances_ etc. — pages in as accessed

For very large models (multi-GB), memmap avoids loading everything into RAM upfront.

Common Mistake: Using pickle for sklearn models instead of joblib.dump. They both work, but joblib is optimized for NumPy arrays — significantly smaller files for tree-based models, neural networks, anything with weight matrices. Use joblib unless you have a specific reason for pickle.

For NumPy-specific patterns that interact with joblib’s array handling, see NumPy not working.

Fix 6: Progress Bars and Verbose Output

from joblib import Parallel, delayed

# Built-in verbose mode — prints progress to stdout
result = Parallel(n_jobs=-1, verbose=10)(
    delayed(slow)(i) for i in range(100)
)
# [Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.1s
# [Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    0.5s
# ...

Verbose levels (0-50):

  • 0 — silent
  • 10 — periodic progress
  • 50 — every task

Use tqdm for a nice progress bar:

from tqdm import tqdm
from joblib import Parallel, delayed

def run_with_progress(tasks, fn):
    with tqdm(total=len(tasks)) as pbar:
        def wrapper(arg):
            result = fn(arg)
            pbar.update(1)
            return result
        return Parallel(n_jobs=-1)(delayed(wrapper)(t) for t in tasks)

results = run_with_progress(range(1000), slow_computation)

Or use tqdm_joblib:

pip install tqdm-joblib
from tqdm_joblib import tqdm_joblib
from joblib import Parallel, delayed

with tqdm_joblib(desc="Processing", total=1000):
    results = Parallel(n_jobs=-1)(delayed(slow)(i) for i in range(1000))

Cleaner integration — progress bar updates as workers finish.

Fix 7: pytest Integration

joblib workers can interfere with pytest’s worker management:

$ pytest tests/
# Hangs or fails in tests that use Parallel(n_jobs=-1)

Use n_jobs=1 during testing:

# my_module.py
import os

def compute_parallel(items):
    n_jobs = 1 if os.environ.get("TESTING") else -1
    return Parallel(n_jobs=n_jobs)(delayed(work)(i) for i in items)

Or set joblib’s global default:

# conftest.py
import os

os.environ["JOBLIB_TEMP_FOLDER"] = "/tmp/joblib-tests"
# Optionally force sequential during tests
os.environ["JOBLIB_NUM_THREADS"] = "1"

Common Mistake: Mixing pytest-xdist (pytest -n auto) with joblib’s n_jobs=-1. Both spawn workers — combined, you get too many processes, slowdown, sometimes deadlock. Disable joblib parallelism in tests (set n_jobs=1 or use env var to switch).

For pytest async fixture patterns that complement joblib testing, see pytest fixture not found.

Fix 8: Memory and Temp File Management

joblib workers write large arrays to shared memory or /tmp for efficient transfer:

import os
os.environ["JOBLIB_TEMP_FOLDER"] = "/path/to/fast-disk"

Default is /tmp — on systems with small /tmp, large parallel jobs fill it up.

Use shared memory for read-only large arrays:

from joblib import Parallel, delayed
import numpy as np

big_array = np.zeros((100_000, 100_000), dtype=np.float32)
# 40 GB array — would be costly to pickle to each worker

# Use memmap so workers share memory
np.save("big_array.npy", big_array)
arr = np.load("big_array.npy", mmap_mode="r")

def process(idx):
    return arr[idx].sum()

results = Parallel(n_jobs=8)(delayed(process)(i) for i in range(100_000))
# Workers access shared memory — no per-worker copy

max_nbytes parameter controls when joblib auto-memmaps:

Parallel(n_jobs=-1, max_nbytes="1M")(
    delayed(fn)(big_array) for _ in range(100)
)
# Args larger than 1MB are memmapped instead of pickled

Default is 1M — usually right; lower for tight memory or higher when pickling overhead matters.

Still Not Working?

joblib vs concurrent.futures vs multiprocessing.Pool

  • joblib — Pickling-friendly, integrated with scikit-learn, memory cache. Best for scientific Python.
  • concurrent.futures — Stdlib, simpler API, less integrated with sklearn. Best for general async work.
  • multiprocessing.Pool — Stdlib, more options, more boilerplate. Use when you need its specific features.

For sklearn / NumPy / SciPy ecosystems, joblib is the path of least resistance. For pure Python with no scientific stack, concurrent.futures is lighter.

Distributed Joblib (Dask Backend)

For scaling beyond one machine:

pip install dask distributed
from joblib import Parallel, delayed, parallel_backend
from dask.distributed import Client

client = Client("scheduler-address:8786")

with parallel_backend("dask"):
    results = Parallel(n_jobs=100)(delayed(fn)(i) for i in range(10000))

The Dask backend distributes work across a cluster — scikit-learn’s n_jobs=-1 with the Dask backend scales to hundreds of cores. For Dask-specific patterns, see Dask not working.

Threading Backend with NumPy

import os
# Limit NumPy/BLAS threads BEFORE importing numpy
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"

import numpy as np
from joblib import Parallel, delayed

# Now joblib's threading backend gives true parallelism
# Without limiting BLAS, NumPy operations multi-thread internally
# and joblib threading + BLAS threading = oversubscription

Pro Tip: Always set OPENBLAS_NUM_THREADS=1 (or MKL_NUM_THREADS=1) when using joblib’s threading backend with NumPy. Otherwise NumPy spawns BLAS threads on top of joblib’s threads — the OS thrashes between them and performance tanks. With BLAS limited to 1 thread, joblib threading achieves the expected speedup.

Worker Process Lifetime

from joblib import Parallel, delayed

# Default: workers reused for many tasks
Parallel(n_jobs=4)(...)

# Force one task per worker (fresh process each time)
from joblib import parallel_backend

with parallel_backend("loky", inner_max_num_threads=1):
    Parallel(n_jobs=4)(...)

inner_max_num_threads=1 is useful when workers themselves spawn threads (BLAS, etc.) and you want to limit total parallelism.

Integrating with scikit-learn

scikit-learn uses joblib internally — when you write model.fit(X, y, n_jobs=-1), it uses joblib’s Parallel under the hood:

from sklearn.ensemble import RandomForestClassifier
from joblib import parallel_backend

# Use joblib's threading backend for sklearn
with parallel_backend("threading"):
    model = RandomForestClassifier(n_jobs=-1)
    model.fit(X_train, y_train)

For scikit-learn patterns that benefit from joblib tuning, see scikit-learn not working.

Debugging Worker Failures

from joblib import Parallel, delayed

# Force sequential for debugging
Parallel(n_jobs=1)(delayed(fn)(i) for i in range(10))

# Or set globally
import os
os.environ["JOBLIB_START_METHOD"] = "spawn"   # macOS/Windows default
os.environ["JOBLIB_TIMEOUT"] = "300"           # Per-task timeout (sec)

If a worker silently fails (no error, just hangs), try n_jobs=1 first to surface the actual exception. The parallel wrapper sometimes obscures the underlying error.

Caching in Notebooks

from joblib import Memory

memory = Memory(".cache", verbose=0)

@memory.cache
def expensive_query():
    return pd.read_sql("SELECT * FROM huge_table", conn)

# First cell run: queries the DB (slow)
df = expensive_query()

# Re-running the cell: cache hit (instant)
df = expensive_query()

Particularly useful in Jupyter where re-running cells is the dev workflow. For Jupyter-specific patterns, see Jupyter not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles