Dealing with Missing Data & Duplicates

Pandas Basics

3 min read

Published Sep 29 2025, updated Sep 30 2025


20
0
0
0

PandasPython

In real-world datasets, missing data is extremely common. Pandas provides multiple ways to detect, remove, and impute (fill in) missing values.

How missing data is dealt with is dependent on the data and is decided on case by case, but here are the methods that can be used.


Best Practices

  1. Always check missing values first
    • df.isnull().sum() → quick overview
  2. Don’t drop too eagerly
    • Dropping rows may cause bias if missing value isn’t random.
    • Dropping columns is fine if they’re mostly missing.
  3. Choose imputation carefully
    • Categorical: Use "Unknown".
    • Numeric: Use mean/median or interpolate.
    • Time series: Forward/backward fill works well
  4. View outputs before applying
    1. inplace=True : Only use this option once you want to apply the changes to the original data, without it you can return a new DataFrame instead, without impacting original data.

Example data for this pages examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", None],
    "Age": [25, np.nan, 35, 40],
    "City": ["New York", "Los Angeles", None, "Chicago"]
})






.isnull() / .isna()

Returns a Boolean mask indicating missing values (True = missing).

Example:

df.isnull()

Output:

    Name Age City
0 False False False
1 False True False
2 False False True
3 True False False





.isnull().sum()

Counts missing values per column.

df.isnull().sum()


Output:

Name 1
Age 1
City 1
dtype: int64





.dropna()

Removes rows or columns with missing values.

df.dropna()

Options:

  • axis=0 → drop rows (default)
  • axis=1 → drop columns
  • how="any" → drop if any value missing (default)
  • how="all" → drop if all values missing
  • thresh=n → keep rows with at least n non-NA values by setting a threshold
# Drop rows where all values are missing
df.dropna(how="all")

# Drop columns with any missing values
df.dropna(axis=1, how="any")





.fillna()

Fill missing values with a specified value or method.

df.fillna("Unknown")

Output:

      Name Age City
0 Alice 25.0 New York
1 Bob NaN Los Angeles
2 Charlie 35.0 Unknown
3 Unknown 40.0 Chicago

Common strategies:

  • Constant value: df['City'].fillna('Unknown')
  • Forward fill: df.fillna(method="ffill") (propagate last valid value forward, leaves missing if no previous data to fill from)
  • Backward fill: df.fillna(method="bfill") (propagate next valid value backward, leaves missing if no next data to fill from)
  • Column mean/median/mode (populate the field using an average value):
df['Age'].fillna(df['Age'].mean())





.replace

Sometimes missing values are encoded with placeholders ("NA", "?", -999).
You can standardise them with .replace() before handling:

# on the whole DataFrame
df.replace("?", np.nan, inplace=True)

# on just a column
df["City"].replace("?", np.nan, inplace=True)

You can map multiple replacements at once:

df["Name"].replace({"NA": "Unknown", "Alice": "Alicia"}, inplace=True)

.replace() works on:

  • Whole DataFramedf.replace("?", np.nan)
  • Single column (Series)df["City"].replace("?", np.nan)
  • Single rowdf.loc[row_index].replace(old, new) (then reassign back to df.loc[row_index])
  • Supports single values, lists, or dictionaries for mapping.
  • Can be combined with inplace=True to avoid reassigning.






.interpolate

  • Fills missing values (NaN) in a numeric column or DataFrame by estimating values based on other existing values.
  • Useful when you want to preserve trends rather than simply forward/backward filling or replacing with a constant.

Linear Interpolation (default):

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": [1, np.nan, 3, np.nan, 5]
})

print(df.interpolate())

Output:

     A
0 1.0
1 2.0 # filled (1 → 3 linear gap)
2 3.0
3 4.0 # filled (3 → 5 linear gap)
4 5.0

Options:

  • .interpolate() estimates missing values instead of just copying or filling constants.
  • Default = linear interpolation along rows.
  • Can handle time-aware interpolation when using datetime indices.
  • Flexible: supports linear, time, polynomial, spline, nearest, etc.
  • Use limit and limit_direction for finer control.




.duplicated

Detect duplicates - all columns:

df.duplicated()

Returns a Boolean Series marking duplicates as True, default checks across all columns.



Detect duplicates - select columns:

df.duplicated(subset=["Name"])




.drop_duplicates

To remove duplicate rows:

df.drop_duplicates(inplace=True)

Can also specify subset:

df.drop_duplicates(subset=["Name"], keep="first", inplace=True)

Options for keep:

  • "first" → keep first occurrence
  • "last" → keep last occurrence
  • False → drop all duplicates

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact