Handling Missing Data
Feature-engine, a Python library for feature engineering
2 min read
Published Oct 3 2025
Guide Sections
Guide Comments
Mean Median Imputer
It replaces missing data with the mean or median value of the variable. It works only with numerical variables.
This is set to calculate the median to use for Col1 and Col5 missing values. After fit()
method is ran, you can view what the learnt parameters are by calling imputer.imputer_dict_
which would display something like {'Col1': 3.0, 'Col5': 2.1}
Arbitrary Number
It replaces missing data in numerical variables with an arbitrary number determined by the user.
Will set all missing values to a hard coded 200 in Col9. After fit()
method is ran, you can view what the learnt parameters are by calling imputer.imputer_dict_
which would display {'Col9': 200}
Categorical Imputer
It replaces missing data in categorical variables by an arbitrary value (typically with the label 'missing') or by the most frequent category.
Arbitrary value example:
Will fill any missing departments or grades with an 'Unkown' value.
Most frequent example:
Will fill any missing countries with whatever country is most frequent in the rest of the populated data.
Drop Missing Data
It deletes rows with missing values, similar to pd.drop_na()
. It can handle numerical and categorical variables.
Drops all rows with missing values. Columns can be specified with the variables parameter, like in other examples above, or thresholds added to filter which rows are dropped.