Fix: Hypothesis Not Working — Strategy Errors, Flaky Tests, and Shrinking Issues
Part of: Python Errors
Quick Answer
How to fix Hypothesis errors — Unsatisfied assumption, Flaky test detected, HealthCheck data_too_large, strategy composition failing, example database stale, settings profile not found, and stateful testing errors.
The Error
You run a property-based test and Hypothesis gives up:
hypothesis.errors.Unsatisfied: Unable to satisfy assumptions of hypothesis test_my_functionOr a test that passed yesterday is now flagged as flaky:
hypothesis.errors.Flaky: Hypothesis test_foo produces unreliable results:
Falsified on the first call but did not on a subsequent oneOr Hypothesis complains about slow data generation:
hypothesis.errors.FailedHealthCheck:
data generation is extremely slow: Only produced 4 valid examples in 1.00 secondsOr @given decorators conflict with pytest fixtures:
InvalidArgument: Got unsatisfiable strategy. Hypothesis cannot generate examples.Or your stateful test fails deep into a long sequence and the shrinking hangs:
Shrinking...
# 30 minutes later, still shrinkingHypothesis is Python’s dominant property-based testing library — instead of hand-writing inputs, you describe the space of valid inputs and Hypothesis generates hundreds of random examples, automatically shrinking failing cases to minimal reproductions. This is more powerful than example-based testing but introduces error modes that don’t exist in regular pytest. This guide covers each.
Why This Happens
Hypothesis uses strategies — descriptions of data spaces — to generate test inputs. Strategies compose (integers, lists, dictionaries, complex types) but can become unsatisfiable if your filters reject most generated values. When a test fails, Hypothesis shrinks the failing input to a minimal reproduction — but shrinking a complex strategy can take a long time, and a test that’s non-deterministic triggers the Flaky error.
The example database (.hypothesis/) stores past failures so subsequent runs re-try them. Stale entries or different code paths can cause unexpected behavior.
Diagnostic Timeline: When Hypothesis “Hangs”
Your first instinct is to bump max_examples higher — Hypothesis must just need more time, right? Wrong. A hang is rarely about throughput; it is almost always about shrinking, a deadline conflict, or stateful state explosion. Here is the actual triage.
Minute 0 — Check whether you are in generation or shrinking. Run with pytest -s --hypothesis-verbosity=verbose. Generation prints Trying example: ... lines. Shrinking prints Shrinking ... and then nothing for a long time. The next steps depend entirely on which phase.
Minute 1 — If you are shrinking, the test already failed. Hypothesis is now searching for the minimal failing input, which for complex strategies (recursive JSON, stateful machines) can take 30+ minutes. Hit Ctrl-C and check the partial output — it usually contains a good-enough failing example. To skip shrinking permanently for an expensive strategy, add @settings(phases=[Phase.explicit, Phase.reuse, Phase.generate]).
Minute 2 — If you are generating, check deadline vs max_examples. A deadline=200 setting means every example must finish in 200ms. If your code under test is slow, every example fails the deadline, gets retried, and Hypothesis explores deeper to find a “real” failure — burning hours without progress. Bump the deadline before bumping examples.
Minute 4 — Check for stateful test state explosion. RuleBasedStateMachine runs sequences of operations. If your invariant is expensive (e.g., scans a list that grows with every deposit rule), each step gets slower than the last. After 50 steps the test machine takes minutes per example. Limit stateful_step_count=20 to bound the worst case.
Minute 6 — Check .hypothesis/examples/ for stale entries. Hypothesis re-runs every saved counterexample at the start of each test. If you accumulated thousands across a refactor, startup becomes painful. rm -rf .hypothesis/ resets the database. Do this when a previously fast test suddenly takes minutes to even start.
The first guess (“just give it more examples”) is wrong. Real causes: shrinking a complex strategy, deadline mismatched to test cost, or a stateful machine whose state grows unboundedly per rule.
Fix 1: Unsatisfied Strategy — Too Much Filtering
from hypothesis import given, strategies as st
@given(st.integers().filter(lambda x: x > 100 and x < 110))
def test_narrow(n):
...
# hypothesis.errors.Unsatisfied: Unable to generate any examples.filter() rejects generated values that don’t match. When the filter rejects most of the random space, Hypothesis runs out of attempts and fails.
Use a bounded strategy instead:
# WRONG — filters out 99.9999% of integers
@given(st.integers().filter(lambda x: 100 < x < 110))
def test_narrow(n): ...
# CORRECT — generate the exact range
@given(st.integers(min_value=101, max_value=109))
def test_narrow(n): ...Composition for complex constraints:
# Generate even integers directly, not via filter
@given(st.integers(min_value=0, max_value=1000).map(lambda x: x * 2))
def test_even(n):
assert n % 2 == 0assume() inside the test — filters after generation:
from hypothesis import given, assume, strategies as st
@given(st.lists(st.integers()))
def test_sort_non_empty(lst):
assume(len(lst) > 0) # Skip empty lists
assume(len(set(lst)) > 1) # Skip lists with all same values
sorted_lst = sorted(lst)
assert sorted_lst[0] <= sorted_lst[-1]assume() is clearer than .filter() for test-specific conditions. If most examples trigger assume(False), you get the same Unsatisfied error — but with easier debugging (add a print before the assume).
Common Mistake: Using .filter(lambda x: valid_condition(x)) when valid_condition is highly restrictive. Always generate valid data directly when possible. Filters are cheap if they reject <10% of values; expensive and flaky if they reject >50%.
Fix 2: Flaky — Non-Deterministic Tests
hypothesis.errors.Flaky: Hypothesis test_foo produces unreliable results:
Falsified on the first call but did not on a subsequent oneHypothesis re-runs a failing input to confirm the failure — if the re-run passes, the test is flaky.
Common causes:
- Test uses global state (time, random seed, environment variables)
- External system dependency (DB, network, filesystem with uncleaned files)
- Mutable default argument that accumulates state
Example — flaky test with time:
import time
@given(st.integers())
def test_time_based(n):
result = some_function(n)
assert result.timestamp < time.time() # Flaky — time.time() changes between callsFix: freeze time or mock dependencies:
from freezegun import freeze_time
@freeze_time("2025-01-01")
@given(st.integers())
def test_time_based(n):
result = some_function(n)
assert result.timestamp < time.time()Example — flaky test with uncleaned files:
@given(st.text())
def test_file_write(content):
with open("/tmp/test.txt", "w") as f:
f.write(content)
# ... test reads /tmp/test.txt
# Previous run's content may still be aroundFix: use a fresh tempdir per run:
import tempfile, os
@given(st.text())
def test_file_write(content):
with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
f.write(content)
path = f.name
try:
# test logic
pass
finally:
os.unlink(path)Register the flake for investigation rather than hiding it:
from hypothesis import given, strategies as st, settings, HealthCheck
@given(st.integers())
@settings(suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_still_flaky(n):
...Use this sparingly — suppressing a Flaky warning without understanding it just hides a real bug.
Fix 3: FailedHealthCheck — Slow or Biased Generation
hypothesis.errors.FailedHealthCheck:
data generation is extremely slow: Only produced 4 valid examples in 1.00 seconds.Data generation should average <1ms per example. If it’s much slower, Hypothesis warns.
Common causes:
- Expensive strategy composition — many nested
.map()/.filter() - Expensive side effect in strategy — reading files, network calls
# WRONG — filesystem read in every example
@given(st.text().map(lambda s: open(f"/tmp/{s}.txt").read()))
def test_bad(content): ...
# WRONG — HTTP call in strategy
@given(st.integers().map(lambda i: requests.get(f"https://api.example.com/{i}").json()))
def test_bad(data): ...Move expensive setup outside the strategy:
# CORRECT — pre-compute data once
cached_data = {i: requests.get(f"https://api.example.com/{i}").json() for i in range(100)}
@given(st.sampled_from(list(cached_data.keys())))
def test_good(key):
data = cached_data[key]
...Suppress specific health checks when you understand the cost:
from hypothesis import given, settings, HealthCheck, strategies as st
@given(st.integers())
@settings(suppress_health_check=[HealthCheck.too_slow])
def test_known_slow(n):
...Common health check types:
| HealthCheck | Meaning |
|---|---|
too_slow | Generation is slow |
data_too_large | Generated data exceeds size limit |
filter_too_much | Too many .filter() rejections |
function_scoped_fixture | pytest fixture may not reset between examples |
return_value | Test function returns non-None |
differing_executors | Different executor than expected |
Pro Tip: Rather than suppressing too_slow, fix it. A slow strategy wastes CPU on every test run and often hides real issues (e.g., generating data that’s too complex). Trim the strategy to only what the test actually needs.
Fix 4: Strategy Composition Patterns
Basic composition:
from hypothesis import given, strategies as st
# Tuple of (int, str)
@given(st.tuples(st.integers(), st.text()))
def test_tuple(t):
i, s = t
# Dict with fixed keys
@given(st.fixed_dictionaries({
"name": st.text(min_size=1),
"age": st.integers(min_value=0, max_value=150),
"email": st.emails(),
}))
def test_user(user): ...
# Dict with dynamic keys
@given(st.dictionaries(
keys=st.text(min_size=1, max_size=10),
values=st.integers(),
min_size=1, max_size=100,
))
def test_dict(d): ...@composite for custom strategies:
from hypothesis import strategies as st
@st.composite
def valid_user(draw):
name = draw(st.text(min_size=1, max_size=50))
age = draw(st.integers(min_value=0, max_value=120))
email = f"{draw(st.text(alphabet='abcdefghij', min_size=3, max_size=10))}@example.com"
return {"name": name, "age": age, "email": email}
@given(valid_user())
def test_user_creation(user):
...Recursive strategies for tree-like data:
from hypothesis import given, strategies as st
json_strategy = st.recursive(
st.one_of(
st.none(),
st.booleans(),
st.integers(),
st.floats(allow_nan=False, allow_infinity=False),
st.text(),
),
lambda children: st.one_of(
st.lists(children),
st.dictionaries(st.text(), children),
),
max_leaves=10,
)
@given(json_strategy)
def test_json_roundtrip(data):
import json
assert json.loads(json.dumps(data)) == dataDataclass and Pydantic model generation:
from dataclasses import dataclass
from hypothesis import given, strategies as st
@dataclass
class User:
name: str
age: int
@given(st.builds(User, name=st.text(min_size=1), age=st.integers(min_value=0, max_value=120)))
def test_user(user):
...For Pydantic models, hypothesis-pydantic auto-generates strategies:
pip install hypothesis[pydantic]Fix 5: @settings for Test Configuration
from hypothesis import given, settings, strategies as st
@given(st.integers())
@settings(
max_examples=500, # Generate 500 examples (default 100)
deadline=1000, # Each example must complete in 1000ms
derandomize=False, # Use random seeds (True = deterministic)
print_blob=True, # Print failure reproduction blob
)
def test_with_settings(n):
...Named profiles for different environments:
from hypothesis import settings, Verbosity
settings.register_profile("ci", max_examples=1000, deadline=5000)
settings.register_profile("dev", max_examples=10, verbosity=Verbosity.verbose)
settings.register_profile("quick", max_examples=5)
# Use via env var or pytest option
# HYPOTHESIS_PROFILE=ci pytest
# pytest --hypothesis-profile=ciThen in conftest.py:
from hypothesis import settings
settings.load_profile("dev") # Default for this projectCommon Mistake: Setting max_examples=10 to “speed up” tests in CI. You lose Hypothesis’s main benefit (finding edge cases). Instead, register a fast dev profile and use a thorough CI profile — CI has time to run many examples, developers need fast feedback.
Fix 6: Shrinking and Reproducing Failures
When a test fails, Hypothesis tries to shrink — find a smaller failing input. This often produces surprisingly minimal reproductions.
@given(st.lists(st.integers(), min_size=1))
def test_sort(lst):
assert sorted(lst) == lst # Wrong test — fails
# Hypothesis output:
# Falsifying example: test_sort(lst=[1, 0])The minimal failing input is [1, 0] — two elements is the smallest possible counterexample.
Reproducing a specific failure:
Hypothesis prints a “blob” (reproduction token) on failure:
You can reproduce this example by temporarily adding:
@reproduce_failure('6.100.0', b'...base64...')from hypothesis import reproduce_failure, given, strategies as st
@reproduce_failure('6.100.0', b'AXic...')
@given(st.lists(st.integers()))
def test_sort(lst):
assert sorted(lst) == lst
# Always runs the same failing exampleShrinking is slow for complex strategies. If shrinking takes > 10 minutes, either simplify your strategy or disable shrinking temporarily:
@settings(phases=[Phase.explicit, Phase.reuse, Phase.generate]) # Skip shrinking
def test_expensive(x): ...Example database — past failures are cached:
# Stored in .hypothesis/examples/
ls .hypothesis/examples/Delete it to reset:
rm -rf .hypothesis/Fix 7: Stateful Testing
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import strategies as st
class BankAccount:
def __init__(self):
self.balance = 0
def deposit(self, amount):
self.balance += amount
def withdraw(self, amount):
if amount > self.balance:
raise ValueError("insufficient funds")
self.balance -= amount
class BankAccountTest(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.account = BankAccount()
@rule(amount=st.integers(min_value=1, max_value=1000))
def deposit(self, amount):
self.account.deposit(amount)
@rule(amount=st.integers(min_value=1, max_value=1000))
def withdraw(self, amount):
try:
self.account.withdraw(amount)
except ValueError:
pass
@invariant()
def balance_never_negative(self):
assert self.account.balance >= 0
# Run via pytest
TestBankAccount = BankAccountTest.TestCase@rule defines operations; @invariant runs after every rule to check consistency. Hypothesis generates random sequences of operations and checks invariants hold.
Stateful testing shrinking can be slow — limit run size:
from hypothesis.stateful import RuleBasedStateMachine, run_state_machine_as_test
from hypothesis import settings
class MyState(RuleBasedStateMachine):
...
TestMyState = MyState.TestCase
TestMyState.settings = settings(max_examples=50, stateful_step_count=30)Fix 8: Integration with pytest
Hypothesis integrates seamlessly with pytest:
# test_math.py
import pytest
from hypothesis import given, strategies as st
@given(st.integers(), st.integers())
def test_add_commutative(a, b):
assert a + b == b + a
@pytest.mark.parametrize("fn", [add, multiply])
@given(st.integers(), st.integers())
def test_operations(fn, a, b):
assert fn(a, b) == fn(b, a)Fixtures with Hypothesis — avoid function-scoped fixtures if they hold state:
# WRONG — fixture creates shared state across Hypothesis examples
@pytest.fixture
def db():
conn = create_db()
yield conn
conn.close()
@given(st.text())
def test_db_insert(db, value): # WARNING: function_scoped_fixture
db.insert(value)Fix — use module scope or re-create inside the test:
@pytest.fixture(scope="module")
def db():
conn = create_db()
yield conn
conn.close()
@given(st.text())
def test_db_insert(db, value):
# Reset state inside the test
db.clear()
db.insert(value)For pytest fixture lifecycle patterns that interact with Hypothesis, see pytest fixture not found. For mypy type-checking of test files using Hypothesis strategies, see Python mypy type error.
Still Not Working?
Hypothesis vs Regular pytest Parametrize
- Regular
@pytest.mark.parametrize— Explicit, small input sets. Best when you know exactly which inputs matter. - Hypothesis
@given— Generative, finds edge cases automatically. Best for general-purpose invariants and transformations.
Use both: parametrize for specific known-tricky inputs, given for broader property coverage.
Targeted Search with @example
Add specific inputs that must always be tested:
from hypothesis import given, example, strategies as st
@given(st.integers())
@example(0)
@example(-1)
@example(2**63 - 1) # Max int64
def test_func(n):
...@example always runs these specific values on every test execution, alongside random generation. Use for known-tricky edge cases.
Coverage and Optimization
Hypothesis shrinks for minimality by default. For faster test runs, limit shrinking:
from hypothesis import settings, Phase
@settings(phases=[Phase.generate, Phase.reuse]) # Skip shrinking entirely
def test_fast(x): ...Use this in CI when you only need to know whether the test fails, not the minimal input.
Type-Based Generation with from_type
from hypothesis import given, strategies as st
from typing import List, Optional
@given(st.from_type(List[int]))
def test_list(lst): ...
@given(st.from_type(Optional[str]))
def test_optional(s): ...Let Hypothesis infer strategies from type annotations. Works for most built-in types and many third-party types.
Custom Type Strategies
Register a strategy for your own types:
from hypothesis import strategies as st
class Money:
def __init__(self, amount, currency):
self.amount = amount
self.currency = currency
st.register_type_strategy(
Money,
st.builds(Money,
amount=st.integers(min_value=0, max_value=1_000_000),
currency=st.sampled_from(["USD", "EUR", "JPY"]),
),
)
@given(st.from_type(Money))
def test_money(m):
assert m.amount >= 0For testing patterns with pre-commit hooks that integrate Hypothesis into the commit workflow, see pre-commit not working. For Ruff-based linting that complements Hypothesis’s property testing, see Ruff not working.
Deadline Errors That Look Like Flakes
A test that passes locally but fails in CI with DeadlineExceeded is almost never flaky — CI runners are slower. The default deadline=200ms is generous for pure functions but breaks for tests that touch the filesystem, spawn subprocesses, or run on shared hardware. Either set deadline=None to disable the check, or use a CI-specific profile: settings.register_profile("ci", deadline=2000); settings.load_profile("ci") based on the CI env var. Disabling per-test (@settings(deadline=None)) is fine but leaks slow tests into the suite over time.
Max Examples vs Deadline Trading Off Coverage
Setting max_examples=10000 with deadline=200 and a 50ms test does not give you 10000 examples — Hypothesis hits the per-test wall clock and stops early. Effective example count = min(max_examples, total_time / per_example_time). If your suite must finish in 10 minutes per test, write @settings(max_examples=1000, deadline=None, derandomize=True) and let Hypothesis run as many examples as it can. derandomize=True makes CI runs reproducible at the cost of less coverage drift between runs.
Stateful Machine State Not Resetting Between Examples
RuleBasedStateMachine re-instantiates the class for each example, but if you stash data on a class-level (not instance-level) attribute, that data leaks across examples. The symptom: a test passes in isolation but fails when run after another example of the same class. Always initialize state in __init__, never in class-level defaults. The same applies to module-level globals touched by @rule methods.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: freezegun Not Working — Datetime Not Frozen, Timezone Issues, and Async Tests
How to fix freezegun errors — freeze_time decorator not affecting datetime.now, timezone-aware datetime mismatch, time.time not frozen, async test time leak, third-party library still using real time, and tick parameter behavior.
Fix: Moto Not Working — Mock Decorator, Real AWS Calls Leaking, and v4 to v5 Migration
How to fix Moto errors — mock not activating, real AWS credentials used in tests, ImportError mock_s3 removed in v5, fixtures with multiple services, NoCredentialsError despite mock, and standalone server mode.
Fix: Nox Not Working — Session Errors, Virtualenv Backends, and Reuse Logic
How to fix Nox errors — no noxfile.py found, session not detected, virtualenv backend uv not installed, session.install fails outside virtualenv, parametrize matrix exploding, and reuse_venv confusion.
Fix: Tox Not Working — Environment Creation, Config Errors, and Multi-Python Testing
How to fix Tox errors — ERROR cannot find Python interpreter, tox.ini config parsing error, allowlist_externals required, recreating environments slow, pyproject.toml integration, and matrix env selection.