Handling Missing Data

Feature-engine, a Python library for feature engineering

2 min read

Published Oct 3 2025

Feature EngineeringFeature-engineMachine LearningPandasPythonscikit-learnTransformers

Mean Median Imputer

It replaces missing data with the mean or median value of the variable. It works only with numerical variables.

imputer = MeanMedianImputer(imputation_method='median', variables=['Col1' , 'Col5'])

This is set to calculate the median to use for Col1 and Col5 missing values. After fit() method is ran, you can view what the learnt parameters are by calling imputer.imputer_dict_ which would display something like {'Col1': 3.0, 'Col5': 2.1}

Arbitrary Number

It replaces missing data in numerical variables with an arbitrary number determined by the user.

imputer = ArbitraryNumberImputer(arbitrary_number=200, variables=['Col9'])

Will set all missing values to a hard coded 200 in Col9. After fit() method is ran, you can view what the learnt parameters are by calling imputer.imputer_dict_ which would display {'Col9': 200}

Categorical Imputer

It replaces missing data in categorical variables by an arbitrary value (typically with the label 'missing') or by the most frequent category.

Arbitrary value example:

imputer = CategoricalImputer(imputation_method='missing',fill_value='Unkown',variables=['Department', 'Grade'])

Will fill any missing departments or grades with an 'Unkown' value.

Most frequent example:

imputer = CategoricalImputer(imputation_method='frequent', variables=['Country'])

Will fill any missing countries with whatever country is most frequent in the rest of the populated data.

Drop Missing Data

It deletes rows with missing values, similar to pd.drop_na(). It can handle numerical and categorical variables.

output = DropMissingData()

Drops all rows with missing values. Columns can be specified with the variables parameter, like in other examples above, or thresholds added to filter which rows are dropped.