๐Ÿ Python Examples - Comprehensive Code Library
โ† Back to PranavKulkarni.org
Lesson 4 ยท Data Science

Data Cleaning & Preprocessing

Learn techniques for handling missing data, outliers, and preparing data for analysis.

The Reality of Data

Real-world data is messy. It often contains missing values, duplicates, and inconsistent formatting.

Handling Missing Values

# Find missing values
null_counts = df.isnull().sum()

# Drop rows with any missing values
df_cleaned = df.dropna()

# Fill missing values with a constant or mean
df["Age"].fillna(df["Age"].mean(), inplace=True)

Removing Duplicates

df.drop_duplicates(inplace=True)

Data Type Conversion

# Convert column to datetime
df["Date"] = pd.to_datetime(df["Date"])

# Convert column to numeric
df["Price"] = pd.to_numeric(df["Price"], errors="coerce")

Renaming and Replacing

df.rename(columns={"old_name": "new_name"}, inplace=True)
df["Gender"].replace({"M": "Male", "F": "Female"}, inplace=True)

โœ… Practice (20 minutes)

  • Create a DataFrame with several missing values (np.nan).
  • Fill numeric columns with the median and categorical columns with the mode.
  • Check for duplicate rows and remove them.
  • Use str.strip() and str.lower() to clean string columns.