Skip to content

Fix: Python threading Not Running in Parallel (GIL Limitations)

FixDevs ·

Quick Answer

How to fix Python threading not achieving parallelism due to the GIL — when to use multiprocessing, concurrent.futures, or asyncio instead, and what the GIL actually blocks.

The Error

You add threads to speed up your Python code but see no performance improvement — or worse, it runs slower:

import threading
import time

def cpu_task(n):
    # Heavy computation
    total = sum(i * i for i in range(n))
    return total

start = time.time()
threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.2f}s")  # ~8s — SLOWER than sequential!

start = time.time()
for _ in range(4): cpu_task(10_000_000)
print(f"Sequential: {time.time() - start:.2f}s")  # ~2s — faster

Threads run but provide no speedup for CPU-bound work.

Why This Happens

CPython (the standard Python interpreter) has the Global Interpreter Lock (GIL) — a mutex that allows only one thread to execute Python bytecode at a time. Even on a multi-core CPU, Python threads cannot run Python code simultaneously.

The GIL exists because CPython’s memory management (reference counting for garbage collection) is not thread-safe. Removing it would require fundamental changes to the interpreter.

What the GIL blocks:

  • Parallel execution of Python bytecode across CPU cores.
  • CPU-bound tasks (computation, data processing, string manipulation) gain nothing from threads.

What the GIL does NOT block:

  • I/O operations — when a thread waits for network, disk, or sleep, it releases the GIL, allowing other threads to run.
  • C extensions that release the GIL (NumPy, OpenSSL, database drivers).
  • Multiple processes — each process has its own GIL.

So threads are useful for I/O-bound work but useless (and harmful) for CPU-bound work.

Fix 1: Use multiprocessing for CPU-Bound Work

Replace threading with multiprocessing for CPU-bound parallelism. Each process has its own Python interpreter and GIL:

Broken — threading for CPU work:

import threading

results = []
lock = threading.Lock()

def cpu_task(n):
    total = sum(i * i for i in range(n))
    with lock:
        results.append(total)

threads = [threading.Thread(target=cpu_task, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# No speedup — GIL prevents parallel execution

Fixed — multiprocessing:

from multiprocessing import Pool

def cpu_task(n):
    return sum(i * i for i in range(n))

if __name__ == "__main__":
    with Pool(4) as pool:  # 4 worker processes
        results = pool.map(cpu_task, [10_000_000] * 4)
    print(results)
    # ~4x faster on a 4-core machine

multiprocessing.Pool runs each task in a separate process with its own GIL. True parallel execution on multiple CPU cores.

When to use which:

WorkloadUse
CPU-bound (computation, data processing)multiprocessing
I/O-bound (network, disk, database)threading or asyncio
Mixed (CPU + I/O)concurrent.futures with appropriate executor
Many lightweight concurrent tasksasyncio

Fix 2: Use concurrent.futures for a Unified API

concurrent.futures provides a consistent interface that works with both threads and processes:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time

def io_task(url):
    """I/O-bound — use ThreadPoolExecutor"""
    import urllib.request
    with urllib.request.urlopen(url) as response:
        return len(response.read())

def cpu_task(n):
    """CPU-bound — use ProcessPoolExecutor"""
    return sum(i * i for i in range(n))

urls = ["https://example.com"] * 10
numbers = [10_000_000] * 4

# I/O-bound: threads work well
with ThreadPoolExecutor(max_workers=10) as executor:
    io_results = list(executor.map(io_task, urls))

# CPU-bound: use processes
with ProcessPoolExecutor(max_workers=4) as executor:
    cpu_results = list(executor.map(cpu_task, numbers))

concurrent.futures is higher-level than threading or multiprocessing directly. Use as_completed() for handling results as they finish:

from concurrent.futures import ProcessPoolExecutor, as_completed

def process_chunk(chunk):
    return sum(x * x for x in chunk)

data_chunks = [range(i, i + 1_000_000) for i in range(0, 10_000_000, 1_000_000)]

with ProcessPoolExecutor() as executor:
    futures = {executor.submit(process_chunk, chunk): i for i, chunk in enumerate(data_chunks)}
    for future in as_completed(futures):
        chunk_index = futures[future]
        result = future.result()
        print(f"Chunk {chunk_index} done: {result}")

Fix 3: Use asyncio for I/O-Bound Concurrency

For I/O-bound tasks (HTTP requests, database queries, file operations), asyncio is more efficient than threads — one thread handles thousands of concurrent operations:

Threading for I/O (works, but has overhead):

import threading
import urllib.request

def fetch(url):
    with urllib.request.urlopen(url) as r:
        return r.read()

threads = [threading.Thread(target=fetch, args=("https://example.com",)) for _ in range(100)]
for t in threads: t.start()
for t in threads: t.join()
# 100 threads — high memory usage, OS context switching overhead

asyncio for I/O (more efficient):

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.read()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, "https://example.com") for _ in range(100)]
        results = await asyncio.gather(*tasks)
    return results

asyncio.run(main())
# 100 concurrent requests, 1 thread — much lower overhead

asyncio uses cooperative multitasking — one OS thread handles all I/O concurrency by switching between tasks when they wait for I/O. No GIL contention because only one task runs at a time, but I/O waits are overlapped.

Fix 4: Use NumPy / C Extensions That Release the GIL

Many scientific computing libraries (NumPy, SciPy, pandas) release the GIL during heavy C-level operations. Threading works for these:

import threading
import numpy as np

def matrix_multiply(size):
    a = np.random.rand(size, size)
    b = np.random.rand(size, size)
    return np.dot(a, b)  # NumPy releases GIL during computation

# This actually runs in parallel because NumPy releases the GIL
threads = [threading.Thread(target=matrix_multiply, args=(1000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# Faster than sequential — NumPy C code runs in parallel

Check if a library releases the GIL by profiling with multiple threads. If adding threads speeds up the work, the library releases the GIL during the heavy operation.

Fix 5: Profile to Confirm the Bottleneck

Before switching from threading to multiprocessing, confirm the bottleneck is CPU-bound (GIL-limited) vs I/O-bound:

import cProfile
import pstats

def my_task():
    # Your code here
    result = sum(i * i for i in range(10_000_000))
    return result

with cProfile.Profile() as pr:
    my_task()

stats = pstats.Stats(pr)
stats.sort_stats("cumulative")
stats.print_stats(10)  # Top 10 slowest functions

Check CPU usage during threading:

# Run your threaded program, then in another terminal:
top -p $(pgrep -f "python your_script.py")
# If CPU usage is ~100% (one core), it's GIL-limited
# If CPU usage is ~400% (four cores), threads ARE running in parallel (I/O or C extension work)

Use py-spy for low-overhead profiling:

pip install py-spy
py-spy top --pid $(pgrep -f python)

Fix 6: Python 3.13+ Free-Threaded Mode (GIL Disabled)

Python 3.13 introduced an experimental build option to disable the GIL (--disable-gil). This is available as a separate build of CPython and is not the default:

# Install free-threaded Python (Python 3.13+)
# On Ubuntu via pyenv:
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0

# Verify GIL is disabled
python -c "import sys; print(sys._is_gil_enabled())"
# False — GIL is disabled
import threading

def cpu_task():
    return sum(i * i for i in range(10_000_000))

# With GIL disabled, threads run in true parallel
threads = [threading.Thread(target=cpu_task) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# ~4x faster on 4 cores

Warning: Free-threaded Python 3.13 is experimental. Many third-party libraries (especially C extensions) are not yet compatible. Use it for testing and exploration, not production workloads.

Fix 7: Practical Patterns for Real-World Use

Web scraping — use threads (I/O-bound):

from concurrent.futures import ThreadPoolExecutor
import requests

def scrape(url):
    return requests.get(url).text

urls = [f"https://example.com/page/{i}" for i in range(100)]

with ThreadPoolExecutor(max_workers=20) as executor:
    pages = list(executor.map(scrape, urls))

Data processing pipeline — use processes (CPU-bound):

from concurrent.futures import ProcessPoolExecutor
import pandas as pd

def process_chunk(filepath):
    df = pd.read_csv(filepath)
    # Heavy transformation
    return df.groupby("category").sum()

files = [f"data_{i}.csv" for i in range(20)]

with ProcessPoolExecutor() as executor:
    results = list(executor.map(process_chunk, files))

combined = pd.concat(results)

Mixed I/O and CPU — chain executors:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import requests

def download(url):
    return requests.get(url).content  # I/O-bound

def process(data):
    return len(data) * 2  # CPU-bound (simplified)

urls = [f"https://example.com/file/{i}" for i in range(10)]

# Step 1: download in threads (I/O-bound)
with ThreadPoolExecutor(max_workers=10) as executor:
    raw_data = list(executor.map(download, urls))

# Step 2: process in processes (CPU-bound)
with ProcessPoolExecutor() as executor:
    results = list(executor.map(process, raw_data))

Still Not Working?

Benchmark before optimizing. Use time.perf_counter() to measure actual execution time with and without parallelism. If the task is fast enough that overhead dominates, parallelism makes it slower.

Check pickling overhead for multiprocessing. Data passed between processes must be pickled (serialized). For large datasets, pickling time can exceed the parallelism benefit. Pass file paths or database queries instead of raw data when possible.

Consider joblib for scientific computing. joblib provides a high-level parallel computing interface commonly used with scikit-learn:

from joblib import Parallel, delayed

results = Parallel(n_jobs=4)(
    delayed(cpu_task)(n) for n in [10_000_000] * 4
)

For multiprocessing-specific errors (freeze_support, pickle errors), see Fix: Python multiprocessing not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles