-->
Written by: Marlon Colca
Posted on 12 May 2025 - 3 months ago
python pandas analytics
Learn how to clean messy data using Pandas. We'll fix missing values, rename columns, convert data types, and prepare our dataset for analysis.
Messy data is the rule, not the exception.
Before any serious analysis or machine learning, we need to make sure the data is clean, consistent, and usable.
In this post, we’ll go over how to:
✅ Fix missing values
✅ Rename columns
✅ Convert data types
✅ Clean strings and categories
Let’s get to work.
We’ll use the same dataset from before (prices_sample.csv
).
import pandas as pd
df = pd.read_csv("prices_sample.csv")
# Check for missing values
print(df.isnull().sum())
Even if this dataset is clean, it’s good practice to check. In real-world data, missing prices, categories, or names are super common.
You have a few options here:
df_clean = df.dropna()
df["price"] = df["price"].fillna(0)
df["brand"] = df["brand"].fillna("Unknown")
df["price"] = df["price"].fillna(df["price"].mean())
Make column names consistent, lowercase, and readable:
# Rename with a dictionary
df.rename(columns={"product_name": "name", "in_stock": "available"}, inplace=True)
# Or make all lowercase
df.columns = [col.lower() for col in df.columns]
Why? Because you don’t want to type Product Name
100 times, or deal with spaces.
Let’s make sure each column has the right type:
# Convert price to float
df["price"] = df["price"].astype(float)
# Convert in_stock to boolean
df["available"] = df["available"].astype(bool)
For dates (in future entries):
df["date"] = pd.to_datetime(df["date"])
Sometimes, text columns come with extra whitespace, wrong casing, or typos.
# Strip whitespace and lowercase everything
df["category"] = df["category"].str.strip().str.lower()
# Standardize brand names
df["brand"] = df["brand"].replace({"Laive ": "Laive", "laive": "Laive"})
You can also remove symbols or accents using regular expressions (str.replace()), or normalize text with unidecode if needed.
Now that we’ve cleaned up, let’s check how things look:
print(df.info())
print(df.head())
.isnull()
, .fillna()
, and .astype()
Next up in the series: ➡️ Filtering, selecting, and slicing your data like a ninja 🥷
We’ll explore .loc
, .iloc
, boolean indexing, and more to really dig into the rows and columns that matter.
See you in Part 4!
Master the art of selecting rows and columns using Pandas. Learn `loc`, `iloc`, conditions, and slicing techniques.