Fix: Hypothesis Not Working — Strategy Errors, Flaky Tests, and Shrinking Issues

Q: How do I fix "Hypothesis Not Working — Strategy Errors, Flaky Tests, and Shrinking Issues"?

How to fix Hypothesis errors — Unsatisfied assumption, Flaky test detected, HealthCheck data_too_large, strategy composition failing, example database stale, settings profile not found, and stateful testing errors.

The Error

You run a property-based test and Hypothesis gives up:

hypothesis.errors.Unsatisfied: Unable to satisfy assumptions of hypothesis test_my_function

Or a test that passed yesterday is now flagged as flaky:

hypothesis.errors.Flaky: Hypothesis test_foo produces unreliable results:
Falsified on the first call but did not on a subsequent one

Or Hypothesis complains about slow data generation:

hypothesis.errors.FailedHealthCheck:
data generation is extremely slow: Only produced 4 valid examples in 1.00 seconds

Or @given decorators conflict with pytest fixtures:

InvalidArgument: Got unsatisfiable strategy. Hypothesis cannot generate examples.

Or your stateful test fails deep into a long sequence and the shrinking hangs:

Shrinking...
# 30 minutes later, still shrinking

Hypothesis is Python’s dominant property-based testing library — instead of hand-writing inputs, you describe the space of valid inputs and Hypothesis generates hundreds of random examples, automatically shrinking failing cases to minimal reproductions. This is more powerful than example-based testing but introduces error modes that don’t exist in regular pytest. This guide covers each.

Why This Happens

Hypothesis uses strategies — descriptions of data spaces — to generate test inputs. Strategies compose (integers, lists, dictionaries, complex types) but can become unsatisfiable if your filters reject most generated values. When a test fails, Hypothesis shrinks the failing input to a minimal reproduction — but shrinking a complex strategy can take a long time, and a test that’s non-deterministic triggers the Flaky error.

The example database (.hypothesis/) stores past failures so subsequent runs re-try them. Stale entries or different code paths can cause unexpected behavior.

Diagnostic Timeline: When Hypothesis “Hangs”

Your first instinct is to bump max_examples higher — Hypothesis must just need more time, right? Wrong. A hang is rarely about throughput; it is almost always about shrinking, a deadline conflict, or stateful state explosion. Here is the actual triage.

Minute 0 — Check whether you are in generation or shrinking. Run with pytest -s --hypothesis-verbosity=verbose. Generation prints Trying example: ... lines. Shrinking prints Shrinking ... and then nothing for a long time. The next steps depend entirely on which phase.

Minute 1 — If you are shrinking, the test already failed. Hypothesis is now searching for the minimal failing input, which for complex strategies (recursive JSON, stateful machines) can take 30+ minutes. Hit Ctrl-C and check the partial output — it usually contains a good-enough failing example. To skip shrinking permanently for an expensive strategy, add @settings(phases=[Phase.explicit, Phase.reuse, Phase.generate]).

Minute 2 — If you are generating, check deadline vs max_examples. A deadline=200 setting means every example must finish in 200ms. If your code under test is slow, every example fails the deadline, gets retried, and Hypothesis explores deeper to find a “real” failure — burning hours without progress. Bump the deadline before bumping examples.

Minute 4 — Check for stateful test state explosion. RuleBasedStateMachine runs sequences of operations. If your invariant is expensive (e.g., scans a list that grows with every deposit rule), each step gets slower than the last. After 50 steps the test machine takes minutes per example. Limit stateful_step_count=20 to bound the worst case.

Minute 6 — Check .hypothesis/examples/ for stale entries. Hypothesis re-runs every saved counterexample at the start of each test. If you accumulated thousands across a refactor, startup becomes painful. rm -rf .hypothesis/ resets the database. Do this when a previously fast test suddenly takes minutes to even start.

The first guess (“just give it more examples”) is wrong. Real causes: shrinking a complex strategy, deadline mismatched to test cost, or a stateful machine whose state grows unboundedly per rule.

Fix 1: Unsatisfied Strategy — Too Much Filtering

from hypothesis import given, strategies as st

@given(st.integers().filter(lambda x: x > 100 and x < 110))
def test_narrow(n):
    ...
# hypothesis.errors.Unsatisfied: Unable to generate any examples

.filter() rejects generated values that don’t match. When the filter rejects most of the random space, Hypothesis runs out of attempts and fails.

Use a bounded strategy instead:

# WRONG — filters out 99.9999% of integers
@given(st.integers().filter(lambda x: 100 < x < 110))
def test_narrow(n): ...

# CORRECT — generate the exact range
@given(st.integers(min_value=101, max_value=109))
def test_narrow(n): ...

Composition for complex constraints:

# Generate even integers directly, not via filter
@given(st.integers(min_value=0, max_value=1000).map(lambda x: x * 2))
def test_even(n):
    assert n % 2 == 0

assume() inside the test — filters after generation:

from hypothesis import given, assume, strategies as st

@given(st.lists(st.integers()))
def test_sort_non_empty(lst):
    assume(len(lst) > 0)   # Skip empty lists
    assume(len(set(lst)) > 1)   # Skip lists with all same values
    sorted_lst = sorted(lst)
    assert sorted_lst[0] <= sorted_lst[-1]

assume() is clearer than .filter() for test-specific conditions. If most examples trigger assume(False), you get the same Unsatisfied error — but with easier debugging (add a print before the assume).

Common Mistake: Using .filter(lambda x: valid_condition(x)) when valid_condition is highly restrictive. Always generate valid data directly when possible. Filters are cheap if they reject <10% of values; expensive and flaky if they reject >50%.

Fix 2: `Flaky` — Non-Deterministic Tests

hypothesis.errors.Flaky: Hypothesis test_foo produces unreliable results:
Falsified on the first call but did not on a subsequent one

Hypothesis re-runs a failing input to confirm the failure — if the re-run passes, the test is flaky.

Common causes:

Test uses global state (time, random seed, environment variables)
External system dependency (DB, network, filesystem with uncleaned files)
Mutable default argument that accumulates state

Example — flaky test with time:

import time

@given(st.integers())
def test_time_based(n):
    result = some_function(n)
    assert result.timestamp < time.time()   # Flaky — time.time() changes between calls

Fix: freeze time or mock dependencies:

from freezegun import freeze_time

@freeze_time("2025-01-01")
@given(st.integers())
def test_time_based(n):
    result = some_function(n)
    assert result.timestamp < time.time()

Example — flaky test with uncleaned files:

@given(st.text())
def test_file_write(content):
    with open("/tmp/test.txt", "w") as f:
        f.write(content)
    # ... test reads /tmp/test.txt
    # Previous run's content may still be around

Fix: use a fresh tempdir per run:

import tempfile, os

@given(st.text())
def test_file_write(content):
    with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
        f.write(content)
        path = f.name
    try:
        # test logic
        pass
    finally:
        os.unlink(path)

Register the flake for investigation rather than hiding it:

from hypothesis import given, strategies as st, settings, HealthCheck

@given(st.integers())
@settings(suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_still_flaky(n):
    ...

Use this sparingly — suppressing a Flaky warning without understanding it just hides a real bug.

Fix 3: `FailedHealthCheck` — Slow or Biased Generation

hypothesis.errors.FailedHealthCheck:
data generation is extremely slow: Only produced 4 valid examples in 1.00 seconds.

Data generation should average <1ms per example. If it’s much slower, Hypothesis warns.

Common causes:

Expensive strategy composition — many nested .map() / .filter()
Expensive side effect in strategy — reading files, network calls

# WRONG — filesystem read in every example
@given(st.text().map(lambda s: open(f"/tmp/{s}.txt").read()))
def test_bad(content): ...

# WRONG — HTTP call in strategy
@given(st.integers().map(lambda i: requests.get(f"https://api.example.com/{i}").json()))
def test_bad(data): ...

Move expensive setup outside the strategy:

# CORRECT — pre-compute data once
cached_data = {i: requests.get(f"https://api.example.com/{i}").json() for i in range(100)}

@given(st.sampled_from(list(cached_data.keys())))
def test_good(key):
    data = cached_data[key]
    ...

Suppress specific health checks when you understand the cost:

from hypothesis import given, settings, HealthCheck, strategies as st

@given(st.integers())
@settings(suppress_health_check=[HealthCheck.too_slow])
def test_known_slow(n):
    ...

Common health check types:

HealthCheck	Meaning
`too_slow`	Generation is slow
`data_too_large`	Generated data exceeds size limit
`filter_too_much`	Too many `.filter()` rejections
`function_scoped_fixture`	pytest fixture may not reset between examples
`return_value`	Test function returns non-None
`differing_executors`	Different executor than expected

Pro Tip: Rather than suppressing too_slow, fix it. A slow strategy wastes CPU on every test run and often hides real issues (e.g., generating data that’s too complex). Trim the strategy to only what the test actually needs.

Fix 4: Strategy Composition Patterns

Basic composition:

from hypothesis import given, strategies as st

# Tuple of (int, str)
@given(st.tuples(st.integers(), st.text()))
def test_tuple(t):
    i, s = t

# Dict with fixed keys
@given(st.fixed_dictionaries({
    "name": st.text(min_size=1),
    "age": st.integers(min_value=0, max_value=150),
    "email": st.emails(),
}))
def test_user(user): ...

# Dict with dynamic keys
@given(st.dictionaries(
    keys=st.text(min_size=1, max_size=10),
    values=st.integers(),
    min_size=1, max_size=100,
))
def test_dict(d): ...

@composite for custom strategies:

from hypothesis import strategies as st

@st.composite
def valid_user(draw):
    name = draw(st.text(min_size=1, max_size=50))
    age = draw(st.integers(min_value=0, max_value=120))
    email = f"{draw(st.text(alphabet='abcdefghij', min_size=3, max_size=10))}@example.com"
    return {"name": name, "age": age, "email": email}

@given(valid_user())
def test_user_creation(user):
    ...

Recursive strategies for tree-like data:

from hypothesis import given, strategies as st

json_strategy = st.recursive(
    st.one_of(
        st.none(),
        st.booleans(),
        st.integers(),
        st.floats(allow_nan=False, allow_infinity=False),
        st.text(),
    ),
    lambda children: st.one_of(
        st.lists(children),
        st.dictionaries(st.text(), children),
    ),
    max_leaves=10,
)

@given(json_strategy)
def test_json_roundtrip(data):
    import json
    assert json.loads(json.dumps(data)) == data

Dataclass and Pydantic model generation:

from dataclasses import dataclass
from hypothesis import given, strategies as st

@dataclass
class User:
    name: str
    age: int

@given(st.builds(User, name=st.text(min_size=1), age=st.integers(min_value=0, max_value=120)))
def test_user(user):
    ...

For Pydantic models, hypothesis-pydantic auto-generates strategies:

pip install hypothesis[pydantic]

Fix 5: `@settings` for Test Configuration

from hypothesis import given, settings, strategies as st

@given(st.integers())
@settings(
    max_examples=500,        # Generate 500 examples (default 100)
    deadline=1000,            # Each example must complete in 1000ms
    derandomize=False,        # Use random seeds (True = deterministic)
    print_blob=True,          # Print failure reproduction blob
)
def test_with_settings(n):
    ...

Named profiles for different environments:

from hypothesis import settings, Verbosity

settings.register_profile("ci", max_examples=1000, deadline=5000)
settings.register_profile("dev", max_examples=10, verbosity=Verbosity.verbose)
settings.register_profile("quick", max_examples=5)

# Use via env var or pytest option
# HYPOTHESIS_PROFILE=ci pytest
# pytest --hypothesis-profile=ci

Then in conftest.py:

from hypothesis import settings

settings.load_profile("dev")   # Default for this project

Common Mistake: Setting max_examples=10 to “speed up” tests in CI. You lose Hypothesis’s main benefit (finding edge cases). Instead, register a fast dev profile and use a thorough CI profile — CI has time to run many examples, developers need fast feedback.

Fix 6: Shrinking and Reproducing Failures

When a test fails, Hypothesis tries to shrink — find a smaller failing input. This often produces surprisingly minimal reproductions.

@given(st.lists(st.integers(), min_size=1))
def test_sort(lst):
    assert sorted(lst) == lst   # Wrong test — fails

# Hypothesis output:
# Falsifying example: test_sort(lst=[1, 0])

The minimal failing input is [1, 0] — two elements is the smallest possible counterexample.

Reproducing a specific failure:

Hypothesis prints a “blob” (reproduction token) on failure:

You can reproduce this example by temporarily adding:
@reproduce_failure('6.100.0', b'...base64...')

from hypothesis import reproduce_failure, given, strategies as st

@reproduce_failure('6.100.0', b'AXic...')
@given(st.lists(st.integers()))
def test_sort(lst):
    assert sorted(lst) == lst
# Always runs the same failing example

Shrinking is slow for complex strategies. If shrinking takes > 10 minutes, either simplify your strategy or disable shrinking temporarily:

@settings(phases=[Phase.explicit, Phase.reuse, Phase.generate])   # Skip shrinking
def test_expensive(x): ...

Example database — past failures are cached:

# Stored in .hypothesis/examples/
ls .hypothesis/examples/

Delete it to reset:

rm -rf .hypothesis/

Fix 7: Stateful Testing

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st

class BankAccount:
    def __init__(self):
        self.balance = 0

    def deposit(self, amount):
        self.balance += amount

    def withdraw(self, amount):
        if amount > self.balance:
            raise ValueError("insufficient funds")
        self.balance -= amount

class BankAccountTest(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.account = BankAccount()

    @rule(amount=st.integers(min_value=1, max_value=1000))
    def deposit(self, amount):
        self.account.deposit(amount)

    @rule(amount=st.integers(min_value=1, max_value=1000))
    def withdraw(self, amount):
        try:
            self.account.withdraw(amount)
        except ValueError:
            pass

    @invariant()
    def balance_never_negative(self):
        assert self.account.balance >= 0

# Run via pytest
TestBankAccount = BankAccountTest.TestCase

@rule defines operations; @invariant runs after every rule to check consistency. Hypothesis generates random sequences of operations and checks invariants hold.

Stateful testing shrinking can be slow — limit run size:

from hypothesis.stateful import RuleBasedStateMachine, run_state_machine_as_test
from hypothesis import settings

class MyState(RuleBasedStateMachine):
    ...

TestMyState = MyState.TestCase
TestMyState.settings = settings(max_examples=50, stateful_step_count=30)

Fix 8: Integration with pytest

Hypothesis integrates seamlessly with pytest:

# test_math.py
import pytest
from hypothesis import given, strategies as st

@given(st.integers(), st.integers())
def test_add_commutative(a, b):
    assert a + b == b + a

@pytest.mark.parametrize("fn", [add, multiply])
@given(st.integers(), st.integers())
def test_operations(fn, a, b):
    assert fn(a, b) == fn(b, a)

Fixtures with Hypothesis — avoid function-scoped fixtures if they hold state:

# WRONG — fixture creates shared state across Hypothesis examples
@pytest.fixture
def db():
    conn = create_db()
    yield conn
    conn.close()

@given(st.text())
def test_db_insert(db, value):   # WARNING: function_scoped_fixture
    db.insert(value)

Fix — use module scope or re-create inside the test:

@pytest.fixture(scope="module")
def db():
    conn = create_db()
    yield conn
    conn.close()

@given(st.text())
def test_db_insert(db, value):
    # Reset state inside the test
    db.clear()
    db.insert(value)

For pytest fixture lifecycle patterns that interact with Hypothesis, see pytest fixture not found. For mypy type-checking of test files using Hypothesis strategies, see Python mypy type error.

Still Not Working?

Hypothesis vs Regular pytest Parametrize

Regular @pytest.mark.parametrize — Explicit, small input sets. Best when you know exactly which inputs matter.
Hypothesis @given — Generative, finds edge cases automatically. Best for general-purpose invariants and transformations.

Use both: parametrize for specific known-tricky inputs, given for broader property coverage.

Targeted Search with `@example`

Add specific inputs that must always be tested:

from hypothesis import given, example, strategies as st

@given(st.integers())
@example(0)
@example(-1)
@example(2**63 - 1)   # Max int64
def test_func(n):
    ...

@example always runs these specific values on every test execution, alongside random generation. Use for known-tricky edge cases.

Coverage and Optimization

Hypothesis shrinks for minimality by default. For faster test runs, limit shrinking:

from hypothesis import settings, Phase

@settings(phases=[Phase.generate, Phase.reuse])   # Skip shrinking entirely
def test_fast(x): ...

Use this in CI when you only need to know whether the test fails, not the minimal input.

Type-Based Generation with `from_type`

from hypothesis import given, strategies as st
from typing import List, Optional

@given(st.from_type(List[int]))
def test_list(lst): ...

@given(st.from_type(Optional[str]))
def test_optional(s): ...

Let Hypothesis infer strategies from type annotations. Works for most built-in types and many third-party types.

Custom Type Strategies

from hypothesis import strategies as st

class Money:
    def __init__(self, amount, currency):
        self.amount = amount
        self.currency = currency

st.register_type_strategy(
    Money,
    st.builds(Money,
        amount=st.integers(min_value=0, max_value=1_000_000),
        currency=st.sampled_from(["USD", "EUR", "JPY"]),
    ),
)

@given(st.from_type(Money))
def test_money(m):
    assert m.amount >= 0

For testing patterns with pre-commit hooks that integrate Hypothesis into the commit workflow, see pre-commit not working. For Ruff-based linting that complements Hypothesis’s property testing, see Ruff not working.

Deadline Errors That Look Like Flakes

A test that passes locally but fails in CI with DeadlineExceeded is almost never flaky — CI runners are slower. The default deadline=200ms is generous for pure functions but breaks for tests that touch the filesystem, spawn subprocesses, or run on shared hardware. Either set deadline=None to disable the check, or use a CI-specific profile: settings.register_profile("ci", deadline=2000); settings.load_profile("ci") based on the CI env var. Disabling per-test (@settings(deadline=None)) is fine but leaks slow tests into the suite over time.

Max Examples vs Deadline Trading Off Coverage

Setting max_examples=10000 with deadline=200 and a 50ms test does not give you 10000 examples — Hypothesis hits the per-test wall clock and stops early. Effective example count = min(max_examples, total_time / per_example_time). If your suite must finish in 10 minutes per test, write @settings(max_examples=1000, deadline=None, derandomize=True) and let Hypothesis run as many examples as it can. derandomize=True makes CI runs reproducible at the cost of less coverage drift between runs.

Stateful Machine State Not Resetting Between Examples

RuleBasedStateMachine re-instantiates the class for each example, but if you stash data on a class-level (not instance-level) attribute, that data leaks across examples. The symptom: a test passes in isolation but fails when run after another example of the same class. Always initialize state in __init__, never in class-level defaults. The same applies to module-level globals touched by @rule methods.

Fix: Hypothesis Not Working — Strategy Errors, Flaky Tests, and Shrinking Issues

The Error

Why This Happens

Diagnostic Timeline: When Hypothesis “Hangs”

Fix 1: Unsatisfied Strategy — Too Much Filtering

Fix 2: `Flaky` — Non-Deterministic Tests

Fix 3: `FailedHealthCheck` — Slow or Biased Generation

Fix 4: Strategy Composition Patterns

Fix 5: `@settings` for Test Configuration

Fix 6: Shrinking and Reproducing Failures

Fix 7: Stateful Testing

Fix 8: Integration with pytest

Still Not Working?

Hypothesis vs Regular pytest Parametrize

Targeted Search with `@example`

Coverage and Optimization

Type-Based Generation with `from_type`

Custom Type Strategies

Deadline Errors That Look Like Flakes

Max Examples vs Deadline Trading Off Coverage

Stateful Machine State Not Resetting Between Examples

Related Articles

Fix: freezegun Not Working — Datetime Not Frozen, Timezone Issues, and Async Tests

Fix: Moto Not Working — Mock Decorator, Real AWS Calls Leaking, and v4 to v5 Migration

Fix: Nox Not Working — Session Errors, Virtualenv Backends, and Reuse Logic

Fix: Tox Not Working — Environment Creation, Config Errors, and Multi-Python Testing

The Error

Why This Happens

Diagnostic Timeline: When Hypothesis “Hangs”

Fix 1: Unsatisfied Strategy — Too Much Filtering

Fix 2: Flaky — Non-Deterministic Tests

Fix 3: FailedHealthCheck — Slow or Biased Generation

Fix 4: Strategy Composition Patterns

Fix 5: @settings for Test Configuration

Fix 6: Shrinking and Reproducing Failures

Fix 7: Stateful Testing

Fix 8: Integration with pytest

Still Not Working?

Hypothesis vs Regular pytest Parametrize

Targeted Search with @example

Coverage and Optimization

Type-Based Generation with from_type

Custom Type Strategies

Deadline Errors That Look Like Flakes

Max Examples vs Deadline Trading Off Coverage

Stateful Machine State Not Resetting Between Examples

Related Articles

Fix: freezegun Not Working — Datetime Not Frozen, Timezone Issues, and Async Tests

Fix: Moto Not Working — Mock Decorator, Real AWS Calls Leaking, and v4 to v5 Migration

Fix: Nox Not Working — Session Errors, Virtualenv Backends, and Reuse Logic

Fix: Tox Not Working — Environment Creation, Config Errors, and Multi-Python Testing

Fix 2: `Flaky` — Non-Deterministic Tests

Fix 3: `FailedHealthCheck` — Slow or Biased Generation

Fix 5: `@settings` for Test Configuration

Targeted Search with `@example`

Type-Based Generation with `from_type`