CARVIEW |
Clean and Validate Your Data Using Pandera
Stop wasting time on dirty data! Learn how to clean it up in minutes with Pandera.

Image by Author | Canva
When working with data, it's important to perform checks to make sure our data isn’t dirty or invalid — like checking for nulls, missing values, or numbers that aren't allowed for a specific column type. These checks are essential because bad data can lead to wrong analysis, failed models, and a lot of wasted time and resources.
You’ve probably already seen the usual way of cleaning and validating data using plain old Pandas, but in this tutorial, I want to show you something better: a powerful Python library called Pandera. Pandera offers a flexible and expressive API for performing data validation on DataFrame-like objects. It’s a much faster and more scalable approach compared to manually checking things. You basically create schemas that define how your data is supposed to look — structure, data types, rules, that kind of stuff. Then Pandera checks your data against those schemas and points out anything that doesn’t fit, so you can catch and fix issues early instead of running into problems later.
This guide assumes you already know a bit of Python and Pandas. Let’s walk through the step-by-step process of using Pandera in your workflows.
Step 1: Setting Up Your Environment
First, you need to install the necessary packages:
pip install pandera pandas
After installation, import the required libraries and verify installation:
import pandas as pd
import pandera as pa
print("pandas version:", pd.__version__)
print("pandera version:", pa.__version__)
This should display the versions of pandas and Pandera, confirming they’re installed correctly as follows:
pandas version: 2.2.2
pandera version: 0.0.0+dev0
Step 2: Creating a Sample Dataset
Let’s create a sample dataset of customer information with intentional errors to demonstrate cleaning and validation:
import pandas as pd
# Customer dataset with errors
data = pd.DataFrame({
"customer_id": [1, 2, 3, 4, "invalid"], # "invalid" is not an integer
"name": ["Maryam", "Jane", "", "Alice", "Bobby"], # Empty name
"age": [25, -5, 30, 45, 35], # Negative age is invalid
"email": ["mrym@gmail.com", "jane.s@yahoo.com", "invalid_email", "alice@google.com", None] # Invalid email and None
})
print("Original DataFrame:")
print(data)
Output:
Original DataFrame:
customer_id name age email
0 1 Maryam 25 mrym@gmail.com
1 2 Jane -5 jane.s@yahoo.com
2 3 30 invalid_email
3 4 Alice 45 alice@google.com
4 invalid Bobby 35 None
Issues in the dataset:
- customer_id: Contains a string ("invalid") instead of integers.
- name: Has an empty string.
- age: Includes a negative value (-5).
- email: Has an invalid format (invalid_email) and a missing value (None).
Step 3: Defining a Pandera Schema
A Pandera schema defines the expected structure and constraints for the DataFrame. We’ll use DataFrameSchema to specify rules for each column:
import pandera as pa
from pandera import Column, Check, DataFrameSchema
# Define the schema
schema = DataFrameSchema({
"customer_id": Column(
dtype="int64", # Use int64 for consistency
checks=[
Check.isin(range(1, 1000)), # IDs between 1 and 999
Check(lambda x: x > 0, element_wise=True) # IDs must be positive
],
nullable=False
),
"name": Column(
dtype="string",
checks=[
Check.str_length(min_value=1), # Names cannot be empty
Check(lambda x: x.strip() != "", element_wise=True) # No empty strings
],
nullable=False
),
"age": Column(
dtype="int64",
checks=[
Check.greater_than(0), # Age must be positive
Check.less_than_or_equal_to(120) # Age must be reasonable
],
nullable=False
),
"email": Column(
dtype="string",
checks=[
Check.str_matches(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$") # Email regex
],
nullable=False
)
})
Step 4: Initial Validation
Now, let’s validate our DataFrame against the schema. Pandera provides the validate method to check if the data conforms to the schema. Set lazy=True to collect all errors:
print("\nInitial Validation:")
try:
validated_df = schema.validate(data, lazy=True)
print("Data is valid!")
print(validated_df)
except pa.errors.SchemaErrors as e:
print("Validation failed with these problems:")
print(e.failure_cases[['column', 'check', 'failure_case', 'index']])
The validation will fail because of the issues in our dataset. The error message will look something like this:
Output:
Initial Validation:
Validation failed with these problems:
column check \
0 customer_id isin(range(1, 1000))
1 name str_length(1, None)
2 name
3 age greater_than(0)
4 email not_nullable
5 email str_matches('^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\....
6 customer_id dtype('int64')
7 customer_id
8 name dtype('string[python]')
9 email dtype('string[python]')
failure_case index
0 invalid 4
1 2
2 2
3 -5 1
4 None 4
5 invalid_email 2
6 object None
7 TypeError("'>' not supported between instances... None
8 object None
9 object None
Step 5: Cleaning the Data
Now that we’ve identified the issues, let’s clean the data to make it conform to the schema. We’ll handle each issue step by step:
- customer_id: Remove rows with non-integer or invalid IDs
- name: Remove rows with empty names
- age: Remove rows with negative or unreasonable ages
- email: Remove rows with invalid or missing emails
# Step 4: Clean the data
# Step 4a: Clean customer_id (convert to int and filter valid IDs)
data["customer_id"] = pd.to_numeric(data["customer_id"], errors="coerce") # Convert to numeric, invalid to NaN
data = data[data["customer_id"].notna()] # Remove NaNs first
data = data[data["customer_id"].isin(range(1, 1000))] # Filter valid IDs
data["customer_id"] = data["customer_id"].astype("int64") # Force int64
# Step 4b: Clean name (remove empty or whitespace-only names)
data = data[data["name"].str.strip() != ""]
data["name"] = data["name"].astype("string[python]")
# Step 4c: Clean age (keep positive and reasonable ages)
data = data[data["age"] > 0]
data = data[data["age"] <= 120]
# Step 4d: Clean email (remove invalid or missing emails)
data = data[data["email"].notna()]
data = data[data["email"].str.match(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")]
data["email"] = data["email"].astype("string[python]")
# Display cleaned data
print("Cleaned DataFrame:")
print(data)
After cleaning, the DataFrame should look like this:
Output:
Cleaned DataFrame:
customer_id name age email
0 1.0 Maryam 25 mrym@gmail.com
1 4.0 Alice 45 alice@google.com
Step 6: Re-Validating the Data
Let’s re-validate the cleaned DataFrame to ensure it now conforms to the schema:
print("\nFinal Validation:")
try:
validated_df = schema.validate(cleaned_data, lazy=True)
print("Cleaned data is valid!")
print(validated_df)
except pa.errors.SchemaErrors as e:
print("Validation failed after cleaning. Errors:")
print(e.failure_cases[['column', 'check', 'failure_case', 'index']])
Output:
Final Validation:
Cleaned data is valid!
customer_id name age email
0 1 Maryam 25 mrym@gmail.com
3 4 Alice 45 alice@google.com
The validation passes, confirming that our cleaning steps resolved all issues.
Step 7: Building a Reusable Pipeline
To make your workflow reusable, you can encapsulate the cleaning and validation in a pipeline like this:
def process_data(df, schema):
"""
Process and validate a DataFrame using a Pandera schema.
Args:
df: Input pandas DataFrame
schema: Pandera DataFrameSchema
Returns:
Validated and cleaned DataFrame, or None if validation fails
"""
# Create a copy for cleaning
data_clean = df.copy()
# Clean customer_id
data_clean["customer_id"] = pd.to_numeric(data_clean["customer_id"], errors="coerce")
data_clean = data_clean[data_clean["customer_id"].notna()]
data_clean = data_clean[data_clean["customer_id"].isin(range(1, 1000))]
data_clean["customer_id"] = data_clean["customer_id"].astype("int64")
# Clean name
data_clean = data_clean[data_clean["name"].str.strip() != ""]
data_clean["name"] = data_clean["name"].astype("string")
# Clean age
data_clean = data_clean[data_clean["age"] > 0]
data_clean = data_clean[data_clean["age"] <= 120]
data_clean["age"] = data_clean["age"].astype("int64")
# Clean email
data_clean = data_clean[data_clean["email"].notna()]
data_clean = data_clean[data_clean["email"].str.match(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")]
data_clean["email"] = data_clean["email"].astype("string")
# Reset index
data_clean = data_clean.reset_index(drop=True)
# Validate
try:
validated_df = schema.validate(data_clean, lazy=True)
print("Data processing successful!")
return validated_df
except pa.errors.SchemaErrors as e:
print("Validation failed after cleaning. Errors:")
print(e.failure_cases[['column', 'check', 'failure_case', 'index']])
return None
# Test the pipeline
print("\nTesting Pipeline:")
final_df = process_data(data, schema)
print("Final Processed DataFrame:")
print(final_df)
Output:
Testing Pipeline:
Data processing successful!
Final Processed DataFrame:
customer_id name age email
0 1 Maryam 25 mrym@gmail.com
1 4 Alice 45 alice@google.com
Pandera can be used for other datasets with the same schema.
Conclusion
Pandera is a powerful tool for ensuring data quality in your pandas workflows. By defining schemas, you can catch errors early, enforce consistency, and automate data cleaning. In this article, we:
- Installed Pandera and set up a sample dataset
- Defined a schema with rules for data types and constraints
- Validated the data and identified issues
- Cleaned the data to conform to the schema
- Re-validated the cleaned data
- Built a reusable pipeline for processing data
Pandera also offers advanced features for complex validation scenarios, such as class-based schemas, cross-field validation, partial validation, and more, which you can explore in the official Pandera documentation.
Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.
- Data Validation for PySpark Applications using Pandera
- How to Write Clean Python Code as a Beginner
- Stop Writing Messy Python: A Clean Code Crash Course
- Talk Directly to Your Data Using Everyday Language
- Using Cluster Analysis to Segment Your Data
- Supercharge Your AI Journey! Join Uplimit's Free Building AI…
Latest Posts
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- A Complete Guide to Seaborn
- 10 Command-Line Tools Every Data Scientist Should Know
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
Top Posts |
---|
- 5 Fun AI Agent Projects for Absolute Beginners
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- 10 Command-Line Tools Every Data Scientist Should Know
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- A Gentle Introduction to TypeScript for Python Programmers
- A Complete Guide to Seaborn
- From Excel to Python: 7 Steps Analysts Can Take Today
- A Gentle Introduction to MCP Servers and Clients