Fix: Pandera Not Working — Schema Validation Errors, DataFrame Types, and Lazy Mode
Quick Answer
How to fix Pandera errors — SchemaError column not in DataFrame, dtype mismatch, Check failed, lazy validation not collecting errors, SchemaModel vs DataFrameModel API, Polars support, and coercion errors.
The Error
You validate a DataFrame and Pandera raises on the first problem:
pandera.errors.SchemaError: Column 'email' not in DataFrame.
Columns in DataFrame: ['id', 'name', 'Email']Or a schema with multiple issues reports only one:
SchemaError: non-nullable series 'age' contains null values: [3, 7]
# But there are also dtype errors, length issues, check failuresOr you migrate from SchemaModel and everything breaks:
ImportError: cannot import name 'SchemaModel' from 'pandera'
# Tutorials use SchemaModel but pandera 0.20+ only has DataFrameModelOr your custom Check silently passes when it should fail:
@pa.check("amount")
def non_negative(cls, series):
return series > 0 # Returns True when all values > 0
# A negative value should fail, but the check returned True anywayOr Pandera rejects valid data because dtype coercion is off:
SchemaError: expected series 'created_at' to have type datetime64[ns],
got datetime64[us]Pandera is the standard DataFrame validation library for Pandas and (increasingly) Polars. It catches schema violations early — before a downstream pipeline crashes on wrong data. But its API has evolved rapidly (SchemaModel → DataFrameModel), the lazy vs eager modes confuse newcomers, and dtype coercion rules have subtle gotchas. This guide covers each.
Why This Happens
Pandera validates DataFrames against schemas. By default it’s eager — the first validation failure raises an exception, so you only see one error at a time. For data discovery and batch pipelines, you usually want lazy mode that collects all errors.
Pandera 0.18 renamed SchemaModel to DataFrameModel (class-based schemas) and added first-class Polars support. Code written against 0.17 or earlier tutorials breaks on the new API. The pa.DataFrameSchema(...) function-based API still works but looks different from modern class-based examples.
Fix 1: Two Schema Styles — Object vs Class-Based
Function-based (DataFrameSchema):
import pandera as pa
import pandas as pd
schema = pa.DataFrameSchema({
"id": pa.Column(int, pa.Check.greater_than(0)),
"name": pa.Column(str),
"age": pa.Column(int, pa.Check.in_range(0, 120), nullable=False),
"email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
})
df = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 45],
"email": ["[email protected]", "[email protected]", "[email protected]"],
})
validated = schema.validate(df) # Raises SchemaError if invalidClass-based (DataFrameModel, pandera 0.18+):
import pandera as pa
from pandera.typing import Series
class UserSchema(pa.DataFrameModel):
id: Series[int] = pa.Field(gt=0)
name: Series[str]
age: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120})
email: Series[str] = pa.Field(str_matches=r".+@.+\..+")
class Config:
strict = True # Extra columns fail validation
coerce = True # Attempt to coerce types
validated = UserSchema.validate(df)Class-based advantages:
- Type checking support (mypy)
- Better IDE autocomplete
- Cleaner syntax for complex schemas
- Composable (inheritance, nested)
Use class-based for new code. The function-based API is still supported but feels older.
Common Mistake: Copying tutorials that use SchemaModel — the class was renamed to DataFrameModel in pandera 0.18 (January 2024). SchemaModel is a deprecated alias and removed in recent versions. Search-and-replace SchemaModel → DataFrameModel when porting old code.
Fix 2: Lazy Validation for Batch Error Reporting
# Eager (default) — raises on first error
try:
schema.validate(df)
except pa.errors.SchemaError as e:
print(e) # Only the first errorLazy mode collects all errors:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e: # Plural — "Errors"
print(e.failure_cases)
# DataFrame with columns: schema_context, column, check, failure_case, index
for _, row in e.failure_cases.iterrows():
print(f"{row['column']}: {row['check']} → {row['failure_case']}")SchemaErrors vs SchemaError — plural for lazy, singular for eager. Distinct exception classes:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
errors = e.failure_cases # All failures as a DataFrame
except pa.errors.SchemaError as e: # Shouldn't fire in lazy mode
passPro Tip: Always use lazy=True in data pipelines. Getting one error per run means you fix it, re-run, get the next error, fix, re-run — slow iteration. Lazy mode surfaces everything at once, so you can triage and fix in parallel.
Print failures as a readable report:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
summary = e.failure_cases.groupby(["column", "check"]).size().reset_index(name="count")
print(summary.to_string(index=False))
# column check failure_case count
# age in_range: 0..120 -5 1
# email str_matches: .+@.+ not-an-email 3Fix 3: Type Coercion and Nullability
import pandera as pa
from pandera.typing import Series
import pandas as pd
class Schema(pa.DataFrameModel):
# Read CSV gives object dtype; coerce to int
count: Series[int] = pa.Field(coerce=True)
df = pd.DataFrame({"count": ["1", "2", "3"]}) # strings
schema = Schema.to_schema()
validated = schema.validate(df)
print(validated["count"].dtype) # int64 (coerced from object)Global coerce:
class Schema(pa.DataFrameModel):
count: Series[int]
name: Series[str]
class Config:
coerce = True # All columns attempt coercionNullable fields:
from typing import Optional
class Schema(pa.DataFrameModel):
required_id: Series[int] # No None allowed
optional_email: Optional[Series[str]] # Allows None
nullable_field: Series[float] = pa.Field(nullable=True) # Allows NaNDatetime coercion — pandas datetime has nanosecond precision by default, causing mismatches:
class Schema(pa.DataFrameModel):
created_at: Series[pd.DatetimeTZDtype] # Must include timezone infoFor date handling:
import pandera as pa
from pandera.typing import Series
from datetime import date
class Schema(pa.DataFrameModel):
order_date: Series[date] # pandas converts to datetime64[ns]
# If you get "expected date, got datetime64[ns]", add coerce
class Schema(pa.DataFrameModel):
order_date: Series[date] = pa.Field(coerce=True)Fix 4: Custom Checks
import pandera as pa
from pandera.typing import Series
class OrderSchema(pa.DataFrameModel):
id: Series[int] = pa.Field(gt=0)
amount: Series[float]
currency: Series[str] = pa.Field(isin=["USD", "EUR", "GBP", "JPY"])
# Column-level check
@pa.check("amount")
def amount_non_negative(cls, amount: Series[float]) -> Series[bool]:
return amount >= 0 # Return boolean Series — True means valid
# Class-level check (operates on whole DataFrame)
@pa.dataframe_check
def usd_amounts_reasonable(cls, df):
usd_rows = df[df["currency"] == "USD"]
return usd_rows["amount"].max() < 100_000@pa.check semantics:
- Decorate a method on a
DataFrameModel - Return a boolean
Series(same length as the column) — True = valid - Return a scalar bool for class-level checks
Common Mistake: Writing return amount > 0 expecting the check to fail when any value is negative. The check succeeds if the whole Series of booleans passes — meaning all elements must be True for the check to pass overall. Pandera reports the failing rows (where your boolean is False), so the logic works — but if you return a scalar like amount.min() > 0, the check passes/fails as one unit without per-row reporting.
Registering reusable checks:
import pandera.extensions as extensions
@extensions.register_check_method(statistics=["currency"])
def is_valid_currency(pandas_obj, *, currency: str):
valid = {"USD": ["0.01", "9999999.99"], "JPY": ["1", "9999999"]}
if currency not in valid:
return False
return (pandas_obj >= float(valid[currency][0])) & (pandas_obj <= float(valid[currency][1]))
# Use in schema
class Schema(pa.DataFrameModel):
usd_amount: Series[float] = pa.Field(is_valid_currency={"currency": "USD"})Fix 5: Multi-Column and Uniqueness Constraints
from pandera.typing import Series, Index
class Schema(pa.DataFrameModel):
user_id: Series[int]
session_id: Series[str]
event_type: Series[str]
timestamp: Series[pd.DatetimeTZDtype]
class Config:
# Combined uniqueness — (user_id, session_id, event_type) must be unique
unique = ["user_id", "session_id", "event_type"]
@pa.dataframe_check
def one_session_per_user(cls, df):
# Each user has at most 10 sessions
return df.groupby("user_id")["session_id"].nunique().max() <= 10Index schema:
class Schema(pa.DataFrameModel):
idx: Index[int] = pa.Field(unique=True, ge=0)
value: Series[float]Multiindex:
from pandera.typing import MultiIndex, Index
class Schema(pa.DataFrameModel):
idx_level_0: Index[str]
idx_level_1: Index[int]
value: Series[float]
class Config:
multiindex_strict = TrueFix 6: Function Decorators — Validate Inputs and Outputs
import pandera as pa
from pandera.typing import DataFrame, Series
class InputSchema(pa.DataFrameModel):
id: Series[int]
amount: Series[float]
class OutputSchema(pa.DataFrameModel):
id: Series[int]
amount: Series[float]
amount_with_tax: Series[float]
@pa.check_types
def add_tax(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
df = df.copy()
df["amount_with_tax"] = df["amount"] * 1.08
return df
# Raises SchemaError if input or output doesn't match
validated = add_tax(my_df)@pa.check_io for explicit input/output validation:
import pandera as pa
input_schema = pa.DataFrameSchema({"x": pa.Column(int)})
output_schema = pa.DataFrameSchema({"y": pa.Column(int)})
@pa.check_io(df=input_schema, out=output_schema)
def transform(df):
df = df.copy()
df["y"] = df["x"] * 2
return df.drop(columns=["x"])@pa.check_input for input only (useful when output is hard to specify):
@pa.check_input(input_schema)
def process(df):
return some_complex_output(df)Fix 7: Polars Support (pandera 0.18+)
pip install pandera[polars]import pandera.polars as pla
import polars as pl
from pandera.typing.polars import Series
class UserSchema(pla.DataFrameModel):
id: Series[int] = pla.Field(gt=0)
name: Series[str]
age: Series[int] = pla.Field(in_range={"min_value": 0, "max_value": 120})
df = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 45],
})
UserSchema.validate(df) # Validates a Polars DataFrameLazy Polars validation (works on LazyFrame):
lazy_df = pl.scan_csv("data.csv")
validated = UserSchema.validate(lazy_df) # Returns validated LazyFrame
result = validated.collect() # Execute the planValidation happens when you .collect() — until then it’s added to the query plan.
For Polars-specific patterns and errors, see Polars not working.
Polars check syntax slightly differs from Pandas:
class Schema(pla.DataFrameModel):
amount: Series[float]
@pla.check("amount")
def non_negative(cls, amount: pl.Expr) -> pl.Expr:
return amount >= 0Polars checks use pl.Expr (polars expressions) instead of pandas Series. The expression system is lazy and composable.
Fix 8: Schema Inference and Generation
Bootstrap a schema from existing data:
import pandera as pa
import pandas as pd
df = pd.read_csv("data.csv")
# Infer schema from the DataFrame
schema = pa.infer_schema(df)
print(schema)
# DataFrameSchema(
# columns={
# "id": Column(int64, ...),
# "name": Column(str, ...),
# ...
# }
# )
# Save for version control
with open("schema.yaml", "w") as f:
f.write(schema.to_yaml())
# Load later
schema = pa.DataFrameSchema.from_yaml("schema.yaml")Infer for DataFrameModel:
from pandera.io import deserialize_schema
schema = pa.infer_schema(df)
print(schema.to_script()) # Prints equivalent DataFrameSchema codeCommon Mistake: Using inferred schemas as-is in production. Inferred schemas are a starting point — they reflect the data you inferred from, including its quirks (specific values in an isin check, narrow ranges). Always review and tighten the inferred schema before committing it as your contract.
Still Not Working?
Pandera vs Pydantic for DataFrames
- Pandera — Native DataFrame support, column-level checks, lazy validation, schema inference. Best for DataFrame-centric pipelines.
- Pydantic — Row-level validation, serialization, rich ecosystem. Best when you’re iterating over rows or interfacing with APIs.
You can combine them: validate individual row records with Pydantic, then assemble into a DataFrame and validate structure with Pandera.
For Pydantic-specific validation patterns, see Pydantic validation error.
Integration with pytest
import pytest
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
x: Series[int] = pa.Field(gt=0)
def test_schema_valid():
df = pd.DataFrame({"x": [1, 2, 3]})
Schema.validate(df)
def test_schema_invalid():
df = pd.DataFrame({"x": [-1, 2, 3]})
with pytest.raises(pa.errors.SchemaError):
Schema.validate(df)
def test_schema_lazy():
df = pd.DataFrame({"x": [-1, 0, 3]})
with pytest.raises(pa.errors.SchemaErrors) as exc:
Schema.validate(df, lazy=True)
assert len(exc.value.failure_cases) == 2 # Two values failFor pytest fixture patterns with DataFrames, see pytest fixture not found.
Integration with dbt Tests
Pandera fits alongside dbt’s SQL-level tests: use Pandera for Python transformations, dbt tests for warehouse data. They validate different layers of the same pipeline. For dbt-specific test patterns, see dbt not working.
Integration with Pandas and NumPy
# When Pandera's dtype error is confusing, inspect pandas dtypes directly
print(df.dtypes)
print(df["col"].dtype.name)For NumPy-level dtype issues that bubble up as Pandera errors, see NumPy not working.
Nested Schemas and Composition
Schemas inherit cleanly — define common fields in a base schema and extend for specific variants:
class BaseEvent(pa.DataFrameModel):
id: Series[int] = pa.Field(gt=0)
user_id: Series[int] = pa.Field(gt=0)
timestamp: Series[pd.DatetimeTZDtype]
class PurchaseEvent(BaseEvent):
product_id: Series[int]
amount: Series[float] = pa.Field(gt=0)
currency: Series[str] = pa.Field(isin=["USD", "EUR", "JPY"])
class LoginEvent(BaseEvent):
ip_address: Series[str] = pa.Field(str_matches=r"^\d+\.\d+\.\d+\.\d+$")
user_agent: Series[str]
# Both schemas include id, user_id, timestamp from BaseEvent
PurchaseEvent.validate(purchase_df)
LoginEvent.validate(login_df)Composition keeps common constraints DRY. Override specific fields in subclasses to tighten or relax rules.
Error Message Customization
Default error messages sometimes lack context for business users. Customize:
class OrderSchema(pa.DataFrameModel):
amount: Series[float] = pa.Field(
gt=0,
description="Total order amount in cents",
)
@pa.check("amount", error="Amount exceeds $10,000 fraud threshold")
def under_fraud_threshold(cls, amount):
return amount <= 1_000_000Error messages appear in the failure_cases DataFrame when using lazy validation — helpful for non-technical stakeholders reviewing data quality reports.
Sync and Async Validation
Pandera is synchronous. For async pipelines, validate in a background thread:
import asyncio
async def validate_async(df, schema):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, schema.validate, df)
validated_df = await validate_async(df, MySchema.to_schema())Validation can be CPU-intensive on large DataFrames — offloading to an executor prevents blocking the event loop.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Polars Not Working — AttributeError, InvalidOperationError, and ShapeError
How to fix Polars errors — AttributeError groupby not found, InvalidOperationError from Python lambdas, ShapeError broadcasting mismatch, lazy vs eager collect confusion, type casting failures, and ColumnNotFoundError in with_columns.
Fix: pandas SettingWithCopyWarning — A value is trying to be set on a copy
How to fix pandas SettingWithCopyWarning — understanding chained indexing, using .loc correctly, Copy-on-Write in pandas 2.x, and when the warning indicates a real bug vs a false alarm.
Fix: pandas merge() Key Error and Duplicate Columns (_x, _y)
How to fix pandas merge and join errors — KeyError on merge key, duplicate _x/_y columns, unexpected row counts, suffixes, and how to validate merge results.
Fix: Pandas SettingWithCopyWarning
Learn how to fix the Pandas SettingWithCopyWarning by using .loc[], .copy(), and avoiding chained indexing in your DataFrame operations.