Dealing with Missing Data & Duplicates

Pandas Basics

3 min read

Published Sep 29 2025, updated Oct 24 2025

PandasPython

In real-world datasets, missing data is extremely common. Pandas provides multiple ways to detect, remove, and impute (fill in) missing values.

How missing data is dealt with is dependent on the data and is decided on case by case, but here are the methods that can be used.

Best Practices

Always check missing values first
- df.isnull().sum() → quick overview
Don’t drop too eagerly
- Dropping rows may cause bias if missing value isn’t random.
- Dropping columns is fine if they’re mostly missing.
Choose imputation carefully
- Categorical: Use "Unknown".
- Numeric: Use mean/median or interpolate.
- Time series: Forward/backward fill works well
View outputs before applying
1. inplace=True : Only use this option once you want to apply the changes to the original data, without it you can return a new DataFrame instead, without impacting original data.

Example data for this pages examples:

import pandas as pd

import numpy as np

df = pd.DataFrame({

"Name": ["Alice", "Bob", "Charlie", None],

"Age": [25, np.nan, 35, 40],

"City": ["New York", "Los Angeles", None, "Chicago"]

})

.isnull() / .isna()

Returns a Boolean mask indicating missing values (True = missing).

Example:

df.isnull()

Output:

Name Age City

0 False False False

1 False True False

2 False False True

3 True False False

.isnull().sum()

Counts missing values per column.

df.isnull().sum()

Output:

Name 1

Age 1

City 1

dtype: int64

.dropna()

Removes rows or columns with missing values.

df.dropna()

Options:

axis=0 → drop rows (default)
axis=1 → drop columns
how="any" → drop if any value missing (default)
how="all" → drop if all values missing
thresh=n → keep rows with at least n non-NA values by setting a threshold

# Drop rows where all values are missing

df.dropna(how="all")

# Drop columns with any missing values

df.dropna(axis=1, how="any")

.fillna()

Fill missing values with a specified value or method.

df.fillna("Unknown")

Output:

Name Age City

0 Alice 25.0 New York

1 Bob NaN Los Angeles

2 Charlie 35.0 Unknown

3 Unknown 40.0 Chicago

Common strategies:

Constant value: df['City'].fillna('Unknown')
Forward fill: df.fillna(method="ffill") (propagate last valid value forward, leaves missing if no previous data to fill from)
Backward fill: df.fillna(method="bfill") (propagate next valid value backward, leaves missing if no next data to fill from)
Column mean/median/mode (populate the field using an average value):

df['Age'].fillna(df['Age'].mean())

.replace

Sometimes missing values are encoded with placeholders ("NA", "?", -999).
You can standardise them with .replace() before handling:

# on the whole DataFrame

df.replace("?", np.nan, inplace=True)

# on just a column

df["City"].replace("?", np.nan, inplace=True)

You can map multiple replacements at once:

df["Name"].replace({"NA": "Unknown", "Alice": "Alicia"}, inplace=True)

.replace() works on:

Whole DataFrame → df.replace("?", np.nan)
Single column (Series) → df["City"].replace("?", np.nan)
Single row → df.loc[row_index].replace(old, new) (then reassign back to df.loc[row_index])
Supports single values, lists, or dictionaries for mapping.
Can be combined with inplace=True to avoid reassigning.

.interpolate

Fills missing values (NaN) in a numeric column or DataFrame by estimating values based on other existing values.
Useful when you want to preserve trends rather than simply forward/backward filling or replacing with a constant.

Linear Interpolation (default):

import pandas as pd

import numpy as np

df = pd.DataFrame({

"A": [1, np.nan, 3, np.nan, 5]

})

print(df.interpolate())

Output:

0 1.0

1 2.0 # filled (1 → 3 linear gap)

2 3.0

3 4.0 # filled (3 → 5 linear gap)

4 5.0

Options:

.interpolate() estimates missing values instead of just copying or filling constants.
Default = linear interpolation along rows.
Can handle time-aware interpolation when using datetime indices.
Flexible: supports linear, time, polynomial, spline, nearest, etc.
Use limit and limit_direction for finer control.