Dealing with Missing Data & Duplicates
Pandas Basics
3 min read
Published Sep 29 2025, updated Sep 30 2025
Guide Sections
Guide Comments
In real-world datasets, missing data is extremely common. Pandas provides multiple ways to detect, remove, and impute (fill in) missing values.
How missing data is dealt with is dependent on the data and is decided on case by case, but here are the methods that can be used.
Best Practices
- Always check missing values first
df.isnull().sum()
→ quick overview
- Don’t drop too eagerly
- Dropping rows may cause bias if missing value isn’t random.
- Dropping columns is fine if they’re mostly missing.
- Choose imputation carefully
- Categorical: Use
"Unknown"
. - Numeric: Use mean/median or interpolate.
- Time series: Forward/backward fill works well
- Categorical: Use
- View outputs before applying
inplace=True
: Only use this option once you want to apply the changes to the original data, without it you can return a new DataFrame instead, without impacting original data.
Example data for this pages examples:
.isnull() / .isna()
Returns a Boolean mask indicating missing values (True
= missing).
Example:
Output:
.isnull().sum()
Counts missing values per column.
Output:
.dropna()
Removes rows or columns with missing values.
Options:
axis=0
→ drop rows (default)axis=1
→ drop columnshow="any"
→ drop if any value missing (default)how="all"
→ drop if all values missingthresh=n
→ keep rows with at leastn
non-NA values by setting a threshold
.fillna()
Fill missing values with a specified value or method.
Output:
Common strategies:
- Constant value:
df['City'].fillna('Unknown')
- Forward fill:
df.fillna(method="ffill")
(propagate last valid value forward, leaves missing if no previous data to fill from) - Backward fill:
df.fillna(method="bfill")
(propagate next valid value backward, leaves missing if no next data to fill from) - Column mean/median/mode (populate the field using an average value):
.replace
Sometimes missing values are encoded with placeholders ("NA"
, "?"
, -999
).
You can standardise them with .replace()
before handling:
You can map multiple replacements at once:
.replace()
works on:
- Whole DataFrame →
df.replace("?", np.nan)
- Single column (Series) →
df["City"].replace("?", np.nan)
- Single row →
df.loc[row_index].replace(old, new)
(then reassign back todf.loc[row_index]
) - Supports single values, lists, or dictionaries for mapping.
- Can be combined with
inplace=True
to avoid reassigning.
.interpolate
- Fills missing values (
NaN
) in a numeric column or DataFrame by estimating values based on other existing values. - Useful when you want to preserve trends rather than simply forward/backward filling or replacing with a constant.
Linear Interpolation (default):
Output:
Options:
.interpolate()
estimates missing values instead of just copying or filling constants.- Default = linear interpolation along rows.
- Can handle time-aware interpolation when using datetime indices.
- Flexible: supports
linear
,time
,polynomial
,spline
,nearest
, etc. - Use
limit
andlimit_direction
for finer control.
.duplicated
Detect duplicates - all columns:
Returns a Boolean Series marking duplicates as True
, default checks across all columns.
Detect duplicates - select columns:
.drop_duplicates
To remove duplicate rows:
Can also specify subset:
Options for keep
:
"first"
→ keep first occurrence"last"
→ keep last occurrenceFalse
→ drop all duplicates