The Cleaning Machine: 6 Steps to Trustworthy Data with Pandas

Once your environment is ready and you know how to run Python code, the next crucial step in any data science project is cleaning the data, often called data wrangling or data pre-processing.

Real-world data is messy. It’s incomplete, inconsistent, duplicated, noisy, badly formatted, or simply wrong.

This chapter will give you a practical framework, your “Cleaning Machine” to turn raw, chaotic data into something trustworthy.

Why Data Cleaning Matters

Imagine trying to build a machine-learning model with flawed data. Models learn from the data that we feed them, so bad data leads to bad results. Even the most advanced algorithms can’t fix:

Duplicated entries
Wrong categorical labels
Missing or impossible values
Unexpected formatting
Outliers that distort your analysis

Cleaning is often 80% of the work in data science… and also where the biggest improvements happen.

In this chapter, we will walk through the cleaning steps and learn how to apply them with Python tools like Pandas.

The Data

Before we begin with the cleaning steps, we first need to have data. Kaggle is a website that offers a wide range of datasets for you to use in practising data science. For this post, we will not use a specific dataset, so feel free to choose any one you like.

Now that we have downloaded a dataset, we need to start by adding our Python packages to the notebook. Add a code cell with the following import statements, and make sure to run the cell so that the packages are available in our notebook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from scipy import stats

Loading the Dataset

How you load your dataset depends on where you’re running Python.

1. Google Colab

To load data from Google Drive, we first need to mount it to our notebook. We can do that by running the following code:

from google.colab import drive
drive.mount('/content/drive')

You will see a link. Log in to your Google account, copy the authorisation code, and paste it back into Colab.

After this step, your Drive files are available at:

/content/drive/MyDrive/

Now we can load the dataset. If your dataset is inside a folder named datasets in Google Drive:

df = pd.read_csv("/content/drive/MyDrive/datasets/data.csv")

# Quickly inspect the first few rows of the DataFrame
df.head()

2. locally (Jupyter, VS Code, etc.)

If your dataset file is in the same folder as your notebook:

# Assuming the file is named data.csv and is in the same directory as the notebook
df = pd.read_csv("data.csv")

# Quickly inspect the first few rows of the DataFrame
df.head()

Note: Every time you rerun the dataset loading cell:

df = pd.read_csv(...)

The data is reloaded from disk into memory, and all previous modifications are lost. This is useful if you want to reset and restart the cleaning process.

The Six Steps of Data Cleaning

Now that we have loaded the data into our notebook, we can begin with the six steps of data cleaning.

1. Remove Irrelevant Data

Not all collected data is useful to us, and in this step, we will remove columns or rows that do not contribute to the analysis goal. This often includes columns with a single constant value (no variance) or columns with non-essential identifiers.

Before removal, we need to assess our data. We can start by getting an idea of the columns and their data types:

df.info()

If you identify a column (let’s call it irrelevant_id) that is not needed for your analysis, you can permanently remove it using the drop() method:

# axis=1 specifies that we are dropping a column, not a row
df = df.drop(columns=["irrelevant_id"])

2. Remove Duplicate Data

In this step, we will identify and eliminate rows that are identical copies of others to prevent analysis bias. Duplicates can sneak into datasets during merging, scraping, or logging and may inflate counts and skew statistical summaries.

We first check how many duplicates exist, and then remove them. Pandas makes this a one-line process:

# Check for the number of duplicate rows
print(f"Number of duplicates {df.duplicated().sum()}")

# Delete duplicates
df = df.drop_duplicates()

3. Repair Structural Errors

The aim for this step is to fix inconsistencies that arise from human error, different data formats, or typos. This is common in categorical columns (e.g., city names or gender).

Examples:

“NY”, “new_york”, and “New York” in the same column
Mis-typed numerical values (100O instead of 1000)
Inconsistent casing (“Male” vs “male”)

Fixing these often involves:

Normalise text:

df["city"] = df["city"].str.lower().str.replace("_", " ")

Standardise categories:

df["gender"] = df["gender"].replace({
    "M": "Male",
    "male": "Male",
    "F": "Female",
    "female": "Female"
})

Convert data types:

# Change data type if possible, else N/A
df["price"] = pd.to_numeric(df["price"], errors="coerce")

4. Address Missing Data / Gaps

Missing data are common and can bias models or cause errors. In this step, we will decide how to handle them.

The method you choose depends on the context. Never treat missing data blindly. But before deciding on a strategy, we must first quantify how much data is missing and where.

# Display the count of missing (NaN) values for every column
df.isnull().sum()

Removal (Dropping): Removing the entire row containing the missing value. This is appropriate if only a small percentage of data is missing.

# Remove rows where ANY column has a missing value (use with caution)
df = df.dropna()

# Remove rows where price columns has a missing value
df = df.dropna(subset=["price"])

Imputation (Filling): Replacing the missing value with a calculated value.
- For numerical columns, you can use the mean, median, or mode. The median is often safer as it is less sensitive to outliers.
```
# Fill missing values in a numerical column ('age') with the median
median_age = df["age"].median()
df["age"] = df["age"].fillna(median_age)
```
- For categorical columns, you can fill with the mode (most frequent value) or a placeholder like “Unknown”.
```
# Fill missing values in a categorical column ("status") with "Unknown"
df["status"] = df["status"].fillna("Unknown")
```

When working with time-series data, where a pattern or seasonality exists, simple imputation can destroy the underlying structure. This is where the Fourier Transformation comes into play. This advanced technique decomposes the time series into its core frequencies (patterns). It allows us to model and predict the missing values based on the dominant cyclical and trend components of the data, providing a much more accurate fill than simple averages.

The essential functions for this are available directly in the NumPy library we already imported (numpy.fft). It involves:

Performing the Fast Fourier Transform (FFT) on the data.
Zeroing out selected high-frequency components (often associated with noise) and potentially using the remaining signal to interpolate the missing points.
Applying the Inverse Fast Fourier Transform (IFFT) to return the smoothed, complete data back to the time domain.

Note: While the core tools are in NumPy, setting up the interpolation logic around the FT requires understanding time series signals and is best suited for advanced analysis.

5. Filter Outliers

Outliers can skew descriptive statistics (especially the mean and standard deviation) and distort the training of machine learning models. In this step, we will identify data points that are significantly different from other observations.

A robust way to identify outliers in a numerical column is using the Interquartile Range (IQR) method. This method flags values that lie outside $1.5 \times IQR$ from the first quartile ( $Q1$ ) and third quartile ( $Q3$ ).

Before we remove them, a good practice is to visualise them using a Box Plot:

# Create a box plot to visually identify outliers in a column (e.g., "price")
plt.figure(figsize=(8, 6))
sns.boxplot(y=df['price'])
plt.title('Box Plot of Price before Outlier Removal')
plt.show()

Identifying and Removing Outliers using IQR:

# Calculate Q1 (25th percentile) and Q3 (75th percentile) for a column ("price")
Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1

# 2. Define the boundaries for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# 3. Filter the DataFrame to keep only the non-outliers
df = df[(df["price"] >= lower_bound) & (df["price"] <= upper_bound)]

6. Validate That the Data Is Correct

In the last step, we need to perform a final check to ensure the cleaned data adheres to the expected rules and formats.

Data validation often involves checking logical conditions specific to your dataset. Your checks should confirm that the steps you just completed actually worked:

Check	Objective	Python Code Example
No Negative Values	Ensures numerical data (like age or price) is logically positive	`(df["age"] >= 0).sum()`
Within bounds	Ensure numerical data falls within our bounds	`df["rating"].between(1, 5).all()`
Categorical Purity	Ensures standardized categories are the ONLY values present	`df["gender"].isin(["Male", "Female"]).all()`
No New Nulls	Confirms no new missing values were created during transformations	`df["category"].notnull().any()`

If any of these checks return False, we still have remaining errors that need to be addressed. Otherwise, our data is officially considered “clean” and ready for the next phase: Visualisation and Analysis!