Skip to content

Fix: Pandera Not Working — Schema Validation Errors, DataFrame Types, and Lazy Mode

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix Pandera errors — SchemaError column not in DataFrame, dtype mismatch, Check failed, lazy validation not collecting errors, SchemaModel vs DataFrameModel API, Polars support, and coercion errors.

The Error

You validate a DataFrame and Pandera raises on the first problem:

pandera.errors.SchemaError: Column 'email' not in DataFrame.
Columns in DataFrame: ['id', 'name', 'Email']

Or a schema with multiple issues reports only one:

SchemaError: non-nullable series 'age' contains null values: [3, 7]
# But there are also dtype errors, length issues, check failures

Or you migrate from SchemaModel and everything breaks:

ImportError: cannot import name 'SchemaModel' from 'pandera'
# Tutorials use SchemaModel but pandera 0.20+ only has DataFrameModel

Or your custom Check silently passes when it should fail:

@pa.check("amount")
def non_negative(cls, series):
    return series > 0   # Returns True when all values > 0
# A negative value should fail, but the check returned True anyway

Or Pandera rejects valid data because dtype coercion is off:

SchemaError: expected series 'created_at' to have type datetime64[ns],
got datetime64[us]

Pandera is the standard DataFrame validation library for Pandas and (increasingly) Polars. It catches schema violations early — before a downstream pipeline crashes on wrong data. But its API has evolved rapidly (SchemaModelDataFrameModel), the lazy vs eager modes confuse newcomers, and dtype coercion rules have subtle gotchas. This guide covers each.

Why This Happens

Pandera validates DataFrames against schemas. By default it’s eager — the first validation failure raises an exception, so you only see one error at a time. For data discovery and batch pipelines, you usually want lazy mode that collects all errors.

Pandera 0.18 renamed SchemaModel to DataFrameModel (class-based schemas) and added first-class Polars support. Code written against 0.17 or earlier tutorials breaks on the new API. The pa.DataFrameSchema(...) function-based API still works but looks different from modern class-based examples.

Fix 1: Two Schema Styles — Object vs Class-Based

Function-based (DataFrameSchema):

import pandera as pa
import pandas as pd

schema = pa.DataFrameSchema({
    "id": pa.Column(int, pa.Check.greater_than(0)),
    "name": pa.Column(str),
    "age": pa.Column(int, pa.Check.in_range(0, 120), nullable=False),
    "email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
})

df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 45],
    "email": ["[email protected]", "[email protected]", "[email protected]"],
})

validated = schema.validate(df)   # Raises SchemaError if invalid

Class-based (DataFrameModel, pandera 0.18+):

import pandera as pa
from pandera.typing import Series

class UserSchema(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    name: Series[str]
    age: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120})
    email: Series[str] = pa.Field(str_matches=r".+@.+\..+")

    class Config:
        strict = True   # Extra columns fail validation
        coerce = True   # Attempt to coerce types

validated = UserSchema.validate(df)

Class-based advantages:

  • Type checking support (mypy)
  • Better IDE autocomplete
  • Cleaner syntax for complex schemas
  • Composable (inheritance, nested)

Use class-based for new code. The function-based API is still supported but feels older.

Common Mistake: Copying tutorials that use SchemaModel — the class was renamed to DataFrameModel in pandera 0.18 (January 2024). SchemaModel is a deprecated alias and removed in recent versions. Search-and-replace SchemaModelDataFrameModel when porting old code.

Fix 2: Lazy Validation for Batch Error Reporting

# Eager (default) — raises on first error
try:
    schema.validate(df)
except pa.errors.SchemaError as e:
    print(e)   # Only the first error

Lazy mode collects all errors:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:   # Plural — "Errors"
    print(e.failure_cases)
    # DataFrame with columns: schema_context, column, check, failure_case, index

    for _, row in e.failure_cases.iterrows():
        print(f"{row['column']}: {row['check']}{row['failure_case']}")

SchemaErrors vs SchemaError — plural for lazy, singular for eager. Distinct exception classes:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    errors = e.failure_cases   # All failures as a DataFrame
except pa.errors.SchemaError as e:   # Shouldn't fire in lazy mode
    pass

Pro Tip: Always use lazy=True in data pipelines. Getting one error per run means you fix it, re-run, get the next error, fix, re-run — slow iteration. Lazy mode surfaces everything at once, so you can triage and fix in parallel.

Print failures as a readable report:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    summary = e.failure_cases.groupby(["column", "check"]).size().reset_index(name="count")
    print(summary.to_string(index=False))
    # column  check                failure_case  count
    # age     in_range: 0..120     -5            1
    # email   str_matches: .+@.+   not-an-email  3

Fix 3: Type Coercion and Nullability

import pandera as pa
from pandera.typing import Series
import pandas as pd

class Schema(pa.DataFrameModel):
    # Read CSV gives object dtype; coerce to int
    count: Series[int] = pa.Field(coerce=True)

df = pd.DataFrame({"count": ["1", "2", "3"]})   # strings
schema = Schema.to_schema()
validated = schema.validate(df)
print(validated["count"].dtype)   # int64 (coerced from object)

Global coerce:

class Schema(pa.DataFrameModel):
    count: Series[int]
    name: Series[str]

    class Config:
        coerce = True   # All columns attempt coercion

Nullable fields:

from typing import Optional

class Schema(pa.DataFrameModel):
    required_id: Series[int]                    # No None allowed
    optional_email: Optional[Series[str]]       # Allows None
    nullable_field: Series[float] = pa.Field(nullable=True)   # Allows NaN

Datetime coercion — pandas datetime has nanosecond precision by default, causing mismatches:

class Schema(pa.DataFrameModel):
    created_at: Series[pd.DatetimeTZDtype]   # Must include timezone info

For date handling:

import pandera as pa
from pandera.typing import Series
from datetime import date

class Schema(pa.DataFrameModel):
    order_date: Series[date]   # pandas converts to datetime64[ns]

# If you get "expected date, got datetime64[ns]", add coerce
class Schema(pa.DataFrameModel):
    order_date: Series[date] = pa.Field(coerce=True)

Fix 4: Custom Checks

import pandera as pa
from pandera.typing import Series

class OrderSchema(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    amount: Series[float]
    currency: Series[str] = pa.Field(isin=["USD", "EUR", "GBP", "JPY"])

    # Column-level check
    @pa.check("amount")
    def amount_non_negative(cls, amount: Series[float]) -> Series[bool]:
        return amount >= 0   # Return boolean Series — True means valid

    # Class-level check (operates on whole DataFrame)
    @pa.dataframe_check
    def usd_amounts_reasonable(cls, df):
        usd_rows = df[df["currency"] == "USD"]
        return usd_rows["amount"].max() < 100_000

@pa.check semantics:

  • Decorate a method on a DataFrameModel
  • Return a boolean Series (same length as the column) — True = valid
  • Return a scalar bool for class-level checks

Common Mistake: Writing return amount > 0 expecting the check to fail when any value is negative. The check succeeds if the whole Series of booleans passes — meaning all elements must be True for the check to pass overall. Pandera reports the failing rows (where your boolean is False), so the logic works — but if you return a scalar like amount.min() > 0, the check passes/fails as one unit without per-row reporting.

Registering reusable checks:

import pandera.extensions as extensions

@extensions.register_check_method(statistics=["currency"])
def is_valid_currency(pandas_obj, *, currency: str):
    valid = {"USD": ["0.01", "9999999.99"], "JPY": ["1", "9999999"]}
    if currency not in valid:
        return False
    return (pandas_obj >= float(valid[currency][0])) & (pandas_obj <= float(valid[currency][1]))

# Use in schema
class Schema(pa.DataFrameModel):
    usd_amount: Series[float] = pa.Field(is_valid_currency={"currency": "USD"})

Fix 5: Multi-Column and Uniqueness Constraints

from pandera.typing import Series, Index

class Schema(pa.DataFrameModel):
    user_id: Series[int]
    session_id: Series[str]
    event_type: Series[str]
    timestamp: Series[pd.DatetimeTZDtype]

    class Config:
        # Combined uniqueness — (user_id, session_id, event_type) must be unique
        unique = ["user_id", "session_id", "event_type"]

    @pa.dataframe_check
    def one_session_per_user(cls, df):
        # Each user has at most 10 sessions
        return df.groupby("user_id")["session_id"].nunique().max() <= 10

Index schema:

class Schema(pa.DataFrameModel):
    idx: Index[int] = pa.Field(unique=True, ge=0)
    value: Series[float]

Multiindex:

from pandera.typing import MultiIndex, Index

class Schema(pa.DataFrameModel):
    idx_level_0: Index[str]
    idx_level_1: Index[int]
    value: Series[float]

    class Config:
        multiindex_strict = True

Fix 6: Function Decorators — Validate Inputs and Outputs

import pandera as pa
from pandera.typing import DataFrame, Series

class InputSchema(pa.DataFrameModel):
    id: Series[int]
    amount: Series[float]

class OutputSchema(pa.DataFrameModel):
    id: Series[int]
    amount: Series[float]
    amount_with_tax: Series[float]

@pa.check_types
def add_tax(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    df = df.copy()
    df["amount_with_tax"] = df["amount"] * 1.08
    return df

# Raises SchemaError if input or output doesn't match
validated = add_tax(my_df)

@pa.check_io for explicit input/output validation:

import pandera as pa

input_schema = pa.DataFrameSchema({"x": pa.Column(int)})
output_schema = pa.DataFrameSchema({"y": pa.Column(int)})

@pa.check_io(df=input_schema, out=output_schema)
def transform(df):
    df = df.copy()
    df["y"] = df["x"] * 2
    return df.drop(columns=["x"])

@pa.check_input for input only (useful when output is hard to specify):

@pa.check_input(input_schema)
def process(df):
    return some_complex_output(df)

Fix 7: Polars Support (pandera 0.18+)

pip install pandera[polars]
import pandera.polars as pla
import polars as pl
from pandera.typing.polars import Series

class UserSchema(pla.DataFrameModel):
    id: Series[int] = pla.Field(gt=0)
    name: Series[str]
    age: Series[int] = pla.Field(in_range={"min_value": 0, "max_value": 120})

df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 45],
})

UserSchema.validate(df)   # Validates a Polars DataFrame

Lazy Polars validation (works on LazyFrame):

lazy_df = pl.scan_csv("data.csv")
validated = UserSchema.validate(lazy_df)   # Returns validated LazyFrame
result = validated.collect()   # Execute the plan

Validation happens when you .collect() — until then it’s added to the query plan.

For Polars-specific patterns and errors, see Polars not working.

Polars check syntax slightly differs from Pandas:

class Schema(pla.DataFrameModel):
    amount: Series[float]

    @pla.check("amount")
    def non_negative(cls, amount: pl.Expr) -> pl.Expr:
        return amount >= 0

Polars checks use pl.Expr (polars expressions) instead of pandas Series. The expression system is lazy and composable.

Fix 8: Schema Inference and Generation

Bootstrap a schema from existing data:

import pandera as pa
import pandas as pd

df = pd.read_csv("data.csv")

# Infer schema from the DataFrame
schema = pa.infer_schema(df)
print(schema)
# DataFrameSchema(
#     columns={
#         "id": Column(int64, ...),
#         "name": Column(str, ...),
#         ...
#     }
# )

# Save for version control
with open("schema.yaml", "w") as f:
    f.write(schema.to_yaml())

# Load later
schema = pa.DataFrameSchema.from_yaml("schema.yaml")

Infer for DataFrameModel:

from pandera.io import deserialize_schema

schema = pa.infer_schema(df)
print(schema.to_script())   # Prints equivalent DataFrameSchema code

Common Mistake: Using inferred schemas as-is in production. Inferred schemas are a starting point — they reflect the data you inferred from, including its quirks (specific values in an isin check, narrow ranges). Always review and tighten the inferred schema before committing it as your contract.

Fix 9: Pandera vs Other Runtime Data Validators

Picking the right validator depends on what shape your data is in when it arrives. Most “Pandera not working” issues come from forcing Pandera on data that another tool handles natively.

Pandera vs Great Expectations. Both validate DataFrames. Great Expectations is heavier — it ships with a docs site, a “Data Docs” generator, a YAML-based suite configuration, and an opinionated workflow for cataloguing expectations across many tables. Pandera is lighter: schemas live as Python classes, validation runs inline, no separate config artifact. Choose Great Expectations for organization-wide data quality programs where non-engineers contribute expectations and need browsable docs. Choose Pandera when validation is part of the code itself (pipeline functions, model inputs, API boundaries) and you don’t want a parallel artifact to maintain.

Pandera vs Pydantic. Pandera is column-first (validates DataFrames); Pydantic is record-first (validates one dict at a time). For row-level objects (API request payloads, NoSQL documents, queue messages), Pydantic wins — its serialization, JSON Schema export, and error messages are richer. For DataFrames, Pydantic forces you to iterate row-by-row, which is slow on millions of rows and loses column-level summary stats. You can combine them: validate individual row records with Pydantic at API ingress, then assemble into a DataFrame and validate structure + cross-row invariants with Pandera. For Pydantic-specific validation patterns, see Pydantic validation error.

Pandera vs dataclass + manual validate. Some teams resist adding a validation library and write @dataclass + a __post_init__ validator. For tiny scripts, this works. The pain shows up when (a) you need lazy validation (manual __post_init__ raises on first error, like Pandera’s eager mode), (b) you want JSON Schema export for OpenAPI, or (c) you need cross-row checks (which dataclasses can’t express). Once any of these matter, Pandera or Pydantic is less code in the long run.

Pandera vs Polars expressions. Polars itself has pl.Expr.is_in, pl.Expr.is_between, and other check expressions. You can hand-build validation with df.select(pl.col("age").is_between(0, 120).all()) and assert on the result. It’s fast (runs in Rust) but you reimplement schema-as-data and lose error reporting. Pandera’s Polars backend gets you the same speed plus structured failure_cases.

Pandera vs DataFusion / Arrow-native validation. If your pipeline is Arrow-first (DataFusion, DuckDB, Polars LazyFrame) and you don’t want a pandas detour, Pandera’s Polars backend (Fix 7) is the only mature option. DataFusion has SQL-level CHECK constraints but no row-by-row validation report. For pure Arrow pipelines, run validation at the Polars layer.

Runtime data validation comparison:

ToolDataFrameRecordLazy errorsSchema as codeDocs gen
PanderaYes (pandas, Polars)NoYesYesLimited
Great ExpectationsYes (pandas, Spark, SQL)NoYesMixed (YAML + Python)Yes (Data Docs)
PydanticNo (row-only)YesLimitedYesOpenAPI export
dataclass + validateNoYes (manual)NoYesNone
Polars exprsYes (Polars only)NoNo (assertions)NoNone
DataFusion CHECKSQL onlyNoNoSQL DDLNone

Pro Tip: Don’t use two DataFrame validators in the same pipeline. Pandera and Great Expectations doing the same checks on the same DataFrame doubles the runtime and splits the error surface. Pick one per pipeline, keep its schemas versioned, and add the other only if you have a clear reason (e.g., GE for org-wide reporting, Pandera for inline runtime checks).

Still Not Working?

Integration with pytest

import pytest
import pandas as pd
import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    x: Series[int] = pa.Field(gt=0)

def test_schema_valid():
    df = pd.DataFrame({"x": [1, 2, 3]})
    Schema.validate(df)

def test_schema_invalid():
    df = pd.DataFrame({"x": [-1, 2, 3]})
    with pytest.raises(pa.errors.SchemaError):
        Schema.validate(df)

def test_schema_lazy():
    df = pd.DataFrame({"x": [-1, 0, 3]})
    with pytest.raises(pa.errors.SchemaErrors) as exc:
        Schema.validate(df, lazy=True)
    assert len(exc.value.failure_cases) == 2   # Two values fail

pytest’s pytest.raises with the SchemaErrors (plural) class is the right pattern — eager mode uses singular SchemaError, and mixing them in tests is a common source of false passes.

Integration with dbt Tests

Pandera fits alongside dbt’s SQL-level tests: use Pandera for Python transformations, dbt tests for warehouse data. They validate different layers of the same pipeline. For dbt-specific test patterns, see dbt not working.

Integration with Pandas and NumPy

# When Pandera's dtype error is confusing, inspect pandas dtypes directly
print(df.dtypes)
print(df["col"].dtype.name)

NumPy-level dtype issues frequently surface as Pandera errors — when in doubt, inspect df["col"].dtype.name directly before assuming the schema is wrong.

Nested Schemas and Composition

Schemas inherit cleanly — define common fields in a base schema and extend for specific variants:

class BaseEvent(pa.DataFrameModel):
    id: Series[int] = pa.Field(gt=0)
    user_id: Series[int] = pa.Field(gt=0)
    timestamp: Series[pd.DatetimeTZDtype]

class PurchaseEvent(BaseEvent):
    product_id: Series[int]
    amount: Series[float] = pa.Field(gt=0)
    currency: Series[str] = pa.Field(isin=["USD", "EUR", "JPY"])

class LoginEvent(BaseEvent):
    ip_address: Series[str] = pa.Field(str_matches=r"^\d+\.\d+\.\d+\.\d+$")
    user_agent: Series[str]

# Both schemas include id, user_id, timestamp from BaseEvent
PurchaseEvent.validate(purchase_df)
LoginEvent.validate(login_df)

Composition keeps common constraints DRY. Override specific fields in subclasses to tighten or relax rules.

Error Message Customization

Default error messages sometimes lack context for business users. Customize:

class OrderSchema(pa.DataFrameModel):
    amount: Series[float] = pa.Field(
        gt=0,
        description="Total order amount in cents",
    )

    @pa.check("amount", error="Amount exceeds $10,000 fraud threshold")
    def under_fraud_threshold(cls, amount):
        return amount <= 1_000_000

Error messages appear in the failure_cases DataFrame when using lazy validation — helpful for non-technical stakeholders reviewing data quality reports.

Sync and Async Validation

Pandera is synchronous. For async pipelines, validate in a background thread:

import asyncio

async def validate_async(df, schema):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, schema.validate, df)

validated_df = await validate_async(df, MySchema.to_schema())

Validation can be CPU-intensive on large DataFrames — offloading to an executor prevents blocking the event loop.

Sampling Validation on Huge DataFrames

Full validation on a 50M-row DataFrame is slow. For staging-pipeline sanity checks, sample first:

import pandas as pd
import pandera as pa

class Schema(pa.DataFrameModel):
    user_id: pa.typing.Series[int] = pa.Field(gt=0)
    amount: pa.typing.Series[float] = pa.Field(ge=0)

def validate_sample(df, schema, n=10_000):
    if len(df) <= n:
        return schema.validate(df, lazy=True)
    sample = df.sample(n=n, random_state=42)
    return schema.validate(sample, lazy=True)

validate_sample(huge_df, Schema)

This catches gross schema breakage (column missing, wrong dtype) in seconds. Run full validation only in nightly batch or pre-deploy gates — not on every interactive transformation.

Versioning Schemas Across Pipeline Changes

When the upstream schema changes (a new optional column, a renamed field), keeping schemas in source control with semantic versions is the fastest debug path:

# schemas/v1.py
class OrderV1(pa.DataFrameModel):
    id: pa.typing.Series[int]
    amount: pa.typing.Series[float]

# schemas/v2.py — added 'currency'
class OrderV2(OrderV1):
    currency: pa.typing.Series[str] = pa.Field(isin=["USD", "EUR"])

def load_orders(version="v2") -> pa.DataFrameSchema:
    if version == "v1":
        return OrderV1.to_schema()
    return OrderV2.to_schema()

When a producer downgrades or upgrades, you point readers at the matching version. Without explicit versioning, schema drift causes mysterious test failures months after the actual change.

Combining Pandera With pytest Parametrization

A common pattern is testing multiple schema variants with the same fixture DataFrame:

import pytest
import pandas as pd
import pandera as pa

@pytest.fixture
def base_df():
    return pd.DataFrame({"x": [1, 2, 3]})

@pytest.mark.parametrize("min_val,expected", [
    (0, True),
    (5, False),
])
def test_min_threshold(base_df, min_val, expected):
    schema = pa.DataFrameSchema({"x": pa.Column(int, pa.Check.ge(min_val))})
    if expected:
        schema.validate(base_df)
    else:
        with pytest.raises(pa.errors.SchemaError):
            schema.validate(base_df)

For pytest fixture patterns that often combine with DataFrame tests, see pytest fixture not found.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles