Fix: Python threading Not Running in Parallel (GIL Limitations)
Quick Answer
How to fix Python threading not achieving parallelism due to the GIL — when to use multiprocessing, concurrent.futures, or asyncio instead, and what the GIL actually blocks.
The Error
You add threads to speed up your Python code but see no performance improvement — or worse, it runs slower:
import threading
import time
def cpu_task(n):
# Heavy computation
total = sum(i * i for i in range(n))
return total
start = time.time()
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.2f}s") # ~8s — SLOWER than sequential!
start = time.time()
for _ in range(4): cpu_task(10_000_000)
print(f"Sequential: {time.time() - start:.2f}s") # ~2s — fasterThreads run but provide no speedup for CPU-bound work.
Why This Happens
CPython (the standard Python interpreter) has the Global Interpreter Lock (GIL) — a mutex that allows only one thread to execute Python bytecode at a time. Even on a multi-core CPU, Python threads cannot run Python code simultaneously.
The GIL exists because CPython’s memory management (reference counting for garbage collection) is not thread-safe. Removing it would require fundamental changes to the interpreter.
What the GIL blocks:
- Parallel execution of Python bytecode across CPU cores.
- CPU-bound tasks (computation, data processing, string manipulation) gain nothing from threads.
What the GIL does NOT block:
- I/O operations — when a thread waits for network, disk, or sleep, it releases the GIL, allowing other threads to run.
- C extensions that release the GIL (NumPy, OpenSSL, database drivers).
- Multiple processes — each process has its own GIL.
So threads are useful for I/O-bound work but useless (and harmful) for CPU-bound work.
Fix 1: Use multiprocessing for CPU-Bound Work
Replace threading with multiprocessing for CPU-bound parallelism. Each process has its own Python interpreter and GIL:
Broken — threading for CPU work:
import threading
results = []
lock = threading.Lock()
def cpu_task(n):
total = sum(i * i for i in range(n))
with lock:
results.append(total)
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# No speedup — GIL prevents parallel executionFixed — multiprocessing:
from multiprocessing import Pool
def cpu_task(n):
return sum(i * i for i in range(n))
if __name__ == "__main__":
with Pool(4) as pool: # 4 worker processes
results = pool.map(cpu_task, [10_000_000] * 4)
print(results)
# ~4x faster on a 4-core machinemultiprocessing.Pool runs each task in a separate process with its own GIL. True parallel execution on multiple CPU cores.
When to use which:
Workload Use CPU-bound (computation, data processing) multiprocessingI/O-bound (network, disk, database) threadingorasyncioMixed (CPU + I/O) concurrent.futureswith appropriate executorMany lightweight concurrent tasks asyncio
Fix 2: Use concurrent.futures for a Unified API
concurrent.futures provides a consistent interface that works with both threads and processes:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time
def io_task(url):
"""I/O-bound — use ThreadPoolExecutor"""
import urllib.request
with urllib.request.urlopen(url) as response:
return len(response.read())
def cpu_task(n):
"""CPU-bound — use ProcessPoolExecutor"""
return sum(i * i for i in range(n))
urls = ["https://example.com"] * 10
numbers = [10_000_000] * 4
# I/O-bound: threads work well
with ThreadPoolExecutor(max_workers=10) as executor:
io_results = list(executor.map(io_task, urls))
# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as executor:
cpu_results = list(executor.map(cpu_task, numbers))concurrent.futures is higher-level than threading or multiprocessing directly. Use as_completed() for handling results as they finish:
from concurrent.futures import ProcessPoolExecutor, as_completed
def process_chunk(chunk):
return sum(x * x for x in chunk)
data_chunks = [range(i, i + 1_000_000) for i in range(0, 10_000_000, 1_000_000)]
with ProcessPoolExecutor() as executor:
futures = {executor.submit(process_chunk, chunk): i for i, chunk in enumerate(data_chunks)}
for future in as_completed(futures):
chunk_index = futures[future]
result = future.result()
print(f"Chunk {chunk_index} done: {result}")Fix 3: Use asyncio for I/O-Bound Concurrency
For I/O-bound tasks (HTTP requests, database queries, file operations), asyncio is more efficient than threads — one thread handles thousands of concurrent operations:
Threading for I/O (works, but has overhead):
import threading
import urllib.request
def fetch(url):
with urllib.request.urlopen(url) as r:
return r.read()
threads = [threading.Thread(target=fetch, args=("https://example.com",)) for _ in range(100)]
for t in threads: t.start()
for t in threads: t.join()
# 100 threads — high memory usage, OS context switching overheadasyncio for I/O (more efficient):
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.read()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "https://example.com") for _ in range(100)]
results = await asyncio.gather(*tasks)
return results
asyncio.run(main())
# 100 concurrent requests, 1 thread — much lower overheadasyncio uses cooperative multitasking — one OS thread handles all I/O concurrency by switching between tasks when they wait for I/O. No GIL contention because only one task runs at a time, but I/O waits are overlapped.
Fix 4: Use NumPy / C Extensions That Release the GIL
Many scientific computing libraries (NumPy, SciPy, pandas) release the GIL during heavy C-level operations. Threading works for these:
import threading
import numpy as np
def matrix_multiply(size):
a = np.random.rand(size, size)
b = np.random.rand(size, size)
return np.dot(a, b) # NumPy releases GIL during computation
# This actually runs in parallel because NumPy releases the GIL
threads = [threading.Thread(target=matrix_multiply, args=(1000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# Faster than sequential — NumPy C code runs in parallelCheck if a library releases the GIL by profiling with multiple threads. If adding threads speeds up the work, the library releases the GIL during the heavy operation.
Fix 5: Profile to Confirm the Bottleneck
Before switching from threading to multiprocessing, confirm the bottleneck is CPU-bound (GIL-limited) vs I/O-bound:
import cProfile
import pstats
def my_task():
# Your code here
result = sum(i * i for i in range(10_000_000))
return result
with cProfile.Profile() as pr:
my_task()
stats = pstats.Stats(pr)
stats.sort_stats("cumulative")
stats.print_stats(10) # Top 10 slowest functionsCheck CPU usage during threading:
# Run your threaded program, then in another terminal:
top -p $(pgrep -f "python your_script.py")
# If CPU usage is ~100% (one core), it's GIL-limited
# If CPU usage is ~400% (four cores), threads ARE running in parallel (I/O or C extension work)Use py-spy for low-overhead profiling:
pip install py-spy
py-spy top --pid $(pgrep -f python)Fix 6: Python 3.13+ Free-Threaded Mode (GIL Disabled)
Python 3.13 introduced an experimental build option to disable the GIL (--disable-gil). This is available as a separate build of CPython and is not the default:
# Install free-threaded Python (Python 3.13+)
# On Ubuntu via pyenv:
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0
# Verify GIL is disabled
python -c "import sys; print(sys._is_gil_enabled())"
# False — GIL is disabledimport threading
def cpu_task():
return sum(i * i for i in range(10_000_000))
# With GIL disabled, threads run in true parallel
threads = [threading.Thread(target=cpu_task) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# ~4x faster on 4 coresWarning: Free-threaded Python 3.13 is experimental. Many third-party libraries (especially C extensions) are not yet compatible. Use it for testing and exploration, not production workloads.
Fix 7: Practical Patterns for Real-World Use
Web scraping — use threads (I/O-bound):
from concurrent.futures import ThreadPoolExecutor
import requests
def scrape(url):
return requests.get(url).text
urls = [f"https://example.com/page/{i}" for i in range(100)]
with ThreadPoolExecutor(max_workers=20) as executor:
pages = list(executor.map(scrape, urls))Data processing pipeline — use processes (CPU-bound):
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
def process_chunk(filepath):
df = pd.read_csv(filepath)
# Heavy transformation
return df.groupby("category").sum()
files = [f"data_{i}.csv" for i in range(20)]
with ProcessPoolExecutor() as executor:
results = list(executor.map(process_chunk, files))
combined = pd.concat(results)Mixed I/O and CPU — chain executors:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import requests
def download(url):
return requests.get(url).content # I/O-bound
def process(data):
return len(data) * 2 # CPU-bound (simplified)
urls = [f"https://example.com/file/{i}" for i in range(10)]
# Step 1: download in threads (I/O-bound)
with ThreadPoolExecutor(max_workers=10) as executor:
raw_data = list(executor.map(download, urls))
# Step 2: process in processes (CPU-bound)
with ProcessPoolExecutor() as executor:
results = list(executor.map(process, raw_data))Still Not Working?
Benchmark before optimizing. Use time.perf_counter() to measure actual execution time with and without parallelism. If the task is fast enough that overhead dominates, parallelism makes it slower.
Check pickling overhead for multiprocessing. Data passed between processes must be pickled (serialized). For large datasets, pickling time can exceed the parallelism benefit. Pass file paths or database queries instead of raw data when possible.
Consider joblib for scientific computing. joblib provides a high-level parallel computing interface commonly used with scikit-learn:
from joblib import Parallel, delayed
results = Parallel(n_jobs=4)(
delayed(cpu_task)(n) for n in [10_000_000] * 4
)For multiprocessing-specific errors (freeze_support, pickle errors), see Fix: Python multiprocessing not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Python multiprocessing Not Working (freeze_support, Pickle Errors, Zombie Processes)
How to fix Python multiprocessing not working — freeze_support error on Windows, pickle errors with lambdas, zombie processes, and Pool hanging indefinitely.
Fix: Flask Route Returns 404 Not Found
How to fix Flask routes returning 404 — trailing slash redirect, Blueprint prefix issues, route not registered, debug mode, and common URL rule mistakes.
Fix: pandas merge() Key Error and Duplicate Columns (_x, _y)
How to fix pandas merge and join errors — KeyError on merge key, duplicate _x/_y columns, unexpected row counts, suffixes, and how to validate merge results.
Fix: Poetry Dependency Conflict (SolverProblemError / No Solution Found)
How to fix Poetry dependency resolution errors — SolverProblemError when adding packages, conflicting version constraints, how to diagnose dependency trees, and workarounds for incompatible packages.