Data Cleaning & Preprocessing
Learn techniques for handling missing data, outliers, and preparing data for analysis.
The Reality of Data
Real-world data is messy. It often contains missing values, duplicates, and inconsistent formatting.
Handling Missing Values
# Find missing values
null_counts = df.isnull().sum()
# Drop rows with any missing values
df_cleaned = df.dropna()
# Fill missing values with a constant or mean
df["Age"].fillna(df["Age"].mean(), inplace=True)
null_counts = df.isnull().sum()
# Drop rows with any missing values
df_cleaned = df.dropna()
# Fill missing values with a constant or mean
df["Age"].fillna(df["Age"].mean(), inplace=True)
Removing Duplicates
df.drop_duplicates(inplace=True)
Data Type Conversion
# Convert column to datetime
df["Date"] = pd.to_datetime(df["Date"])
# Convert column to numeric
df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
df["Date"] = pd.to_datetime(df["Date"])
# Convert column to numeric
df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
Renaming and Replacing
df.rename(columns={"old_name": "new_name"}, inplace=True)
df["Gender"].replace({"M": "Male", "F": "Female"}, inplace=True)
df["Gender"].replace({"M": "Male", "F": "Female"}, inplace=True)
โ Practice (20 minutes)
- Create a DataFrame with several missing values (
np.nan). - Fill numeric columns with the median and categorical columns with the mode.
- Check for duplicate rows and remove them.
- Use
str.strip()andstr.lower()to clean string columns.