Skip to content

Fix: Pandera Not Working — Schema Validation Errors, DataFrame Types, and Lazy Mode

FixDevs ·

Quick Answer

How to fix Pandera errors — SchemaError column not in DataFrame, dtype mismatch, Check failed, lazy validation not collecting errors, SchemaModel vs DataFrameModel API, Polars support, and coercion errors.

The Error

You validate a DataFrame and Pandera raises on the first problem:

pandera.errors.SchemaError: Column 'email' not in DataFrame.
Columns in DataFrame: ['id', 'name', 'Email']

Or a schema with multiple issues reports only one:

SchemaError: non-nullable series 'age' contains null values: [3, 7]
# But there are also dtype errors, length issues, check failures

Or you migrate from SchemaModel and everything breaks:

ImportError: cannot import name 'SchemaModel' from 'pandera'
# Tutorials use SchemaModel but pandera 0.20+ only has DataFrameModel

Or your custom Check silently passes when it should fail:

@pa.check("amount")
def non_negative(cls, series):
    return series > 0   # Returns True when all values > 0
# A negative value should fail, but the check returned True anyway

Or Pandera rejects valid data because dtype coercion is off:

SchemaError: expected series 'created_at' to have type datetime64[ns],
got datetime64[us]

Pandera is the standard DataFrame validation library for Pandas and (increasingly) Polars. It catches schema violations early — before a downstream pipeline crashes on wrong data. But its API has evolved rapidly (SchemaModelDataFrameModel), the lazy vs eager modes confuse newcomers, and dtype coercion rules have subtle gotchas. This guide covers each.

Why This Happens

Pandera validates DataFrames against schemas. By default it’s eager — the first validation failure raises an exception, so you only see one error at a time. For data discovery and batch pipelines, you usually want lazy mode that collects all errors.

Pandera 0.18 renamed SchemaModel to DataFrameModel (class-based schemas) and added first-class Polars support. Code written against 0.17 or earlier tutorials breaks on the new API. The pa.DataFrameSchema(...) function-based API still works but looks different from modern class-based examples.

Fix 1: Two Schema Styles — Object vs Class-Based

Function-based (DataFrameSchema):

import pandera as pa
import pandas as pd

schema = pa.DataFrameSchema({
    "id": pa.Column(int, pa.Check.greater_than(0)),
    "name": pa.Column(str),
    "age": pa.Column(int, pa.Check.in_range(0, 120), nullable=False),
    "email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
})

df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 45],
    "email": ["[email protected]", "[email protected]", "[email protected]"],
})

validated = schema.validate(df)   # Raises SchemaError if invalid

Class-based (DataFrameModel, pandera 0.18+):

import pandera as pa
from pandera.typing import Series

class UserSchema(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    name: Series[str]
    age: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120})
    email: Series[str] = pa.Field(str_matches=r".+@.+\..+")

    class Config:
        strict = True   # Extra columns fail validation
        coerce = True   # Attempt to coerce types

validated = UserSchema.validate(df)

Class-based advantages:

  • Type checking support (mypy)
  • Better IDE autocomplete
  • Cleaner syntax for complex schemas
  • Composable (inheritance, nested)

Use class-based for new code. The function-based API is still supported but feels older.

Common Mistake: Copying tutorials that use SchemaModel — the class was renamed to DataFrameModel in pandera 0.18 (January 2024). SchemaModel is a deprecated alias and removed in recent versions. Search-and-replace SchemaModelDataFrameModel when porting old code.

Fix 2: Lazy Validation for Batch Error Reporting

# Eager (default) — raises on first error
try:
    schema.validate(df)
except pa.errors.SchemaError as e:
    print(e)   # Only the first error

Lazy mode collects all errors:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:   # Plural — "Errors"
    print(e.failure_cases)
    # DataFrame with columns: schema_context, column, check, failure_case, index

    for _, row in e.failure_cases.iterrows():
        print(f"{row['column']}: {row['check']}{row['failure_case']}")

SchemaErrors vs SchemaError — plural for lazy, singular for eager. Distinct exception classes:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    errors = e.failure_cases   # All failures as a DataFrame
except pa.errors.SchemaError as e:   # Shouldn't fire in lazy mode
    pass

Pro Tip: Always use lazy=True in data pipelines. Getting one error per run means you fix it, re-run, get the next error, fix, re-run — slow iteration. Lazy mode surfaces everything at once, so you can triage and fix in parallel.

Print failures as a readable report:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    summary = e.failure_cases.groupby(["column", "check"]).size().reset_index(name="count")
    print(summary.to_string(index=False))
    # column  check                failure_case  count
    # age     in_range: 0..120     -5            1
    # email   str_matches: .+@.+   not-an-email  3

Fix 3: Type Coercion and Nullability

import pandera as pa
from pandera.typing import Series
import pandas as pd

class Schema(pa.DataFrameModel):
    # Read CSV gives object dtype; coerce to int
    count: Series[int] = pa.Field(coerce=True)

df = pd.DataFrame({"count": ["1", "2", "3"]})   # strings
schema = Schema.to_schema()
validated = schema.validate(df)
print(validated["count"].dtype)   # int64 (coerced from object)

Global coerce:

class Schema(pa.DataFrameModel):
    count: Series[int]
    name: Series[str]

    class Config:
        coerce = True   # All columns attempt coercion

Nullable fields:

from typing import Optional

class Schema(pa.DataFrameModel):
    required_id: Series[int]                    # No None allowed
    optional_email: Optional[Series[str]]       # Allows None
    nullable_field: Series[float] = pa.Field(nullable=True)   # Allows NaN

Datetime coercion — pandas datetime has nanosecond precision by default, causing mismatches:

class Schema(pa.DataFrameModel):
    created_at: Series[pd.DatetimeTZDtype]   # Must include timezone info

For date handling:

import pandera as pa
from pandera.typing import Series
from datetime import date

class Schema(pa.DataFrameModel):
    order_date: Series[date]   # pandas converts to datetime64[ns]

# If you get "expected date, got datetime64[ns]", add coerce
class Schema(pa.DataFrameModel):
    order_date: Series[date] = pa.Field(coerce=True)

Fix 4: Custom Checks

import pandera as pa
from pandera.typing import Series

class OrderSchema(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    amount: Series[float]
    currency: Series[str] = pa.Field(isin=["USD", "EUR", "GBP", "JPY"])

    # Column-level check
    @pa.check("amount")
    def amount_non_negative(cls, amount: Series[float]) -> Series[bool]:
        return amount >= 0   # Return boolean Series — True means valid

    # Class-level check (operates on whole DataFrame)
    @pa.dataframe_check
    def usd_amounts_reasonable(cls, df):
        usd_rows = df[df["currency"] == "USD"]
        return usd_rows["amount"].max() < 100_000

@pa.check semantics:

  • Decorate a method on a DataFrameModel
  • Return a boolean Series (same length as the column) — True = valid
  • Return a scalar bool for class-level checks

Common Mistake: Writing return amount > 0 expecting the check to fail when any value is negative. The check succeeds if the whole Series of booleans passes — meaning all elements must be True for the check to pass overall. Pandera reports the failing rows (where your boolean is False), so the logic works — but if you return a scalar like amount.min() > 0, the check passes/fails as one unit without per-row reporting.

Registering reusable checks:

import pandera.extensions as extensions

@extensions.register_check_method(statistics=["currency"])
def is_valid_currency(pandas_obj, *, currency: str):
    valid = {"USD": ["0.01", "9999999.99"], "JPY": ["1", "9999999"]}
    if currency not in valid:
        return False
    return (pandas_obj >= float(valid[currency][0])) & (pandas_obj <= float(valid[currency][1]))

# Use in schema
class Schema(pa.DataFrameModel):
    usd_amount: Series[float] = pa.Field(is_valid_currency={"currency": "USD"})

Fix 5: Multi-Column and Uniqueness Constraints

from pandera.typing import Series, Index

class Schema(pa.DataFrameModel):
    user_id: Series[int]
    session_id: Series[str]
    event_type: Series[str]
    timestamp: Series[pd.DatetimeTZDtype]

    class Config:
        # Combined uniqueness — (user_id, session_id, event_type) must be unique
        unique = ["user_id", "session_id", "event_type"]

    @pa.dataframe_check
    def one_session_per_user(cls, df):
        # Each user has at most 10 sessions
        return df.groupby("user_id")["session_id"].nunique().max() <= 10

Index schema:

class Schema(pa.DataFrameModel):
    idx: Index[int] = pa.Field(unique=True, ge=0)
    value: Series[float]

Multiindex:

from pandera.typing import MultiIndex, Index

class Schema(pa.DataFrameModel):
    idx_level_0: Index[str]
    idx_level_1: Index[int]
    value: Series[float]

    class Config:
        multiindex_strict = True

Fix 6: Function Decorators — Validate Inputs and Outputs

import pandera as pa
from pandera.typing import DataFrame, Series

class InputSchema(pa.DataFrameModel):
    id: Series[int]
    amount: Series[float]

class OutputSchema(pa.DataFrameModel):
    id: Series[int]
    amount: Series[float]
    amount_with_tax: Series[float]

@pa.check_types
def add_tax(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    df = df.copy()
    df["amount_with_tax"] = df["amount"] * 1.08
    return df

# Raises SchemaError if input or output doesn't match
validated = add_tax(my_df)

@pa.check_io for explicit input/output validation:

import pandera as pa

input_schema = pa.DataFrameSchema({"x": pa.Column(int)})
output_schema = pa.DataFrameSchema({"y": pa.Column(int)})

@pa.check_io(df=input_schema, out=output_schema)
def transform(df):
    df = df.copy()
    df["y"] = df["x"] * 2
    return df.drop(columns=["x"])

@pa.check_input for input only (useful when output is hard to specify):

@pa.check_input(input_schema)
def process(df):
    return some_complex_output(df)

Fix 7: Polars Support (pandera 0.18+)

pip install pandera[polars]
import pandera.polars as pla
import polars as pl
from pandera.typing.polars import Series

class UserSchema(pla.DataFrameModel):
    id: Series[int] = pla.Field(gt=0)
    name: Series[str]
    age: Series[int] = pla.Field(in_range={"min_value": 0, "max_value": 120})

df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 45],
})

UserSchema.validate(df)   # Validates a Polars DataFrame

Lazy Polars validation (works on LazyFrame):

lazy_df = pl.scan_csv("data.csv")
validated = UserSchema.validate(lazy_df)   # Returns validated LazyFrame
result = validated.collect()   # Execute the plan

Validation happens when you .collect() — until then it’s added to the query plan.

For Polars-specific patterns and errors, see Polars not working.

Polars check syntax slightly differs from Pandas:

class Schema(pla.DataFrameModel):
    amount: Series[float]

    @pla.check("amount")
    def non_negative(cls, amount: pl.Expr) -> pl.Expr:
        return amount >= 0

Polars checks use pl.Expr (polars expressions) instead of pandas Series. The expression system is lazy and composable.

Fix 8: Schema Inference and Generation

Bootstrap a schema from existing data:

import pandera as pa
import pandas as pd

df = pd.read_csv("data.csv")

# Infer schema from the DataFrame
schema = pa.infer_schema(df)
print(schema)
# DataFrameSchema(
#     columns={
#         "id": Column(int64, ...),
#         "name": Column(str, ...),
#         ...
#     }
# )

# Save for version control
with open("schema.yaml", "w") as f:
    f.write(schema.to_yaml())

# Load later
schema = pa.DataFrameSchema.from_yaml("schema.yaml")

Infer for DataFrameModel:

from pandera.io import deserialize_schema

schema = pa.infer_schema(df)
print(schema.to_script())   # Prints equivalent DataFrameSchema code

Common Mistake: Using inferred schemas as-is in production. Inferred schemas are a starting point — they reflect the data you inferred from, including its quirks (specific values in an isin check, narrow ranges). Always review and tighten the inferred schema before committing it as your contract.

Still Not Working?

Pandera vs Pydantic for DataFrames

  • Pandera — Native DataFrame support, column-level checks, lazy validation, schema inference. Best for DataFrame-centric pipelines.
  • Pydantic — Row-level validation, serialization, rich ecosystem. Best when you’re iterating over rows or interfacing with APIs.

You can combine them: validate individual row records with Pydantic, then assemble into a DataFrame and validate structure with Pandera.

For Pydantic-specific validation patterns, see Pydantic validation error.

Integration with pytest

import pytest
import pandas as pd
import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    x: Series[int] = pa.Field(gt=0)

def test_schema_valid():
    df = pd.DataFrame({"x": [1, 2, 3]})
    Schema.validate(df)

def test_schema_invalid():
    df = pd.DataFrame({"x": [-1, 2, 3]})
    with pytest.raises(pa.errors.SchemaError):
        Schema.validate(df)

def test_schema_lazy():
    df = pd.DataFrame({"x": [-1, 0, 3]})
    with pytest.raises(pa.errors.SchemaErrors) as exc:
        Schema.validate(df, lazy=True)
    assert len(exc.value.failure_cases) == 2   # Two values fail

For pytest fixture patterns with DataFrames, see pytest fixture not found.

Integration with dbt Tests

Pandera fits alongside dbt’s SQL-level tests: use Pandera for Python transformations, dbt tests for warehouse data. They validate different layers of the same pipeline. For dbt-specific test patterns, see dbt not working.

Integration with Pandas and NumPy

# When Pandera's dtype error is confusing, inspect pandas dtypes directly
print(df.dtypes)
print(df["col"].dtype.name)

For NumPy-level dtype issues that bubble up as Pandera errors, see NumPy not working.

Nested Schemas and Composition

Schemas inherit cleanly — define common fields in a base schema and extend for specific variants:

class BaseEvent(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    user_id: Series[int] = pa.Field(gt=0)
    timestamp: Series[pd.DatetimeTZDtype]

class PurchaseEvent(BaseEvent):
    product_id: Series[int]
    amount: Series[float] = pa.Field(gt=0)
    currency: Series[str] = pa.Field(isin=["USD", "EUR", "JPY"])

class LoginEvent(BaseEvent):
    ip_address: Series[str] = pa.Field(str_matches=r"^\d+\.\d+\.\d+\.\d+$")
    user_agent: Series[str]

# Both schemas include id, user_id, timestamp from BaseEvent
PurchaseEvent.validate(purchase_df)
LoginEvent.validate(login_df)

Composition keeps common constraints DRY. Override specific fields in subclasses to tighten or relax rules.

Error Message Customization

Default error messages sometimes lack context for business users. Customize:

class OrderSchema(pa.DataFrameModel):
    amount: Series[float] = pa.Field(
        gt=0,
        description="Total order amount in cents",
    )

    @pa.check("amount", error="Amount exceeds $10,000 fraud threshold")
    def under_fraud_threshold(cls, amount):
        return amount <= 1_000_000

Error messages appear in the failure_cases DataFrame when using lazy validation — helpful for non-technical stakeholders reviewing data quality reports.

Sync and Async Validation

Pandera is synchronous. For async pipelines, validate in a background thread:

import asyncio

async def validate_async(df, schema):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, schema.validate, df)

validated_df = await validate_async(df, MySchema.to_schema())

Validation can be CPU-intensive on large DataFrames — offloading to an executor prevents blocking the event loop.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles