How Feature-engine integrates with a scikit-learn

Feature-engine, a Python library for feature engineering

3 min read

Published Oct 3 2025

Feature EngineeringFeature-engineMachine LearningPandasPythonscikit-learnTransformers

What is the scikit-learn transformer API?

The scikit-learn transformer API is a standardised interface for preprocessing and modifying data in a way that can be consistently applied across training, testing, and new datasets. All transformers implement two main methods: fit() and transform(). The fit() method is used to learn parameters from the training data, such as the mean and standard deviation for scaling, the most frequent value for imputation, or the set of unique categories for encoding. Importantly, fit() does not change the data itself; it only calculates and stores the information required to apply the transformation later.

The transform() method is where the actual modifications to the data occur. It applies the parameters learned during fit() to generate the transformed output, such as filling missing values, scaling numeric features, encoding categories, or creating new derived columns. Many transformers also provide a combined fit_transform() method that runs both steps sequentially for convenience. Because transformers follow this API, they can be chained together in scikit-learn pipelines, allowing multiple preprocessing steps to be applied in order, ensuring that transformations are learned from the training set and applied consistently to test or unseen data. This design underpins the reproducibility and reliability of machine learning workflows in scikit-learn.

What happens at `fit` vs `transform`?

fit():
- Only computes and stores parameters from the training data.
- Example: for a MeanImputer, it calculates the column mean and stores it.
- Example: for a YearExtractor, it doesn’t need to compute anything — it just notes what column it will operate on.
transform():
- Actually modifies the data: fills missing values, encodes categories, extracts new columns (like year or age), etc.
- So new columns are created at transform time, not at fit.

How Feature-engine Works & Integrates with scikit-learn

Feature-engine is pandas-first:

Implements both fit() and transform() methods and they take a DataFrame as input.
transform() always returns a DataFrame as output.
If a transformer adds columns (like age), the new DataFrame includes those columns — with proper column names preserved.

This means that Feature-engine transformers can be used in a scikit-learn pipeline, and the next transformer in the pipeline automatically “sees” the new columns, no extra manual work needed. It also allows you to view/chart the data at each transform step when initially setting up and deciding on transformers that are required.

Here is an example (with an extremely small data set, but gives an idea of the sort of transforms that can be done):

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

from feature_engine.imputation import MeanMedianImputer

from feature_engine.encoding import OneHotEncoder

from feature_engine.scaling import MeanNormalizationScaler

# Test dataset

data = pd.DataFrame({

"Department": ["HR", "IT", "Sales", "Admin", "IT", "Sales"],

"StartYear": [2010, 2018, 2015, None, 2020, 2018],

"Salary": [40000, 60000, None, 40000, 50000, 80000],

"DateOfBirth": pd.to_datetime([

"1980-01-01", "1990-06-15", "1975-09-30",

"2000-12-25", "1992-10-28", "1982-04-20"

]),

"AnnualBonus": [2000, 8500, 3400, 2784, 3242, 6643]

})

# Calculate age and drop DateOfBirth

current_year = pd.Timestamp.now().year

data["Age"] = current_year - data["DateOfBirth"].dt.year

X = data.drop(["AnnualBonus", "DateOfBirth"], axis=1)

y = data["AnnualBonus"]

# Build pipeline

pipe = Pipeline([

# 1. Impute missing numeric values with median

("imputer", MeanMedianImputer(imputation_method="median", variables=["StartYear", "Salary"])),

# 2. Encode Department column

("encoder", OneHotEncoder(variables=["Department"], drop_last=True)),

# 3. Scale numeric features

("scaler", MeanNormalizationScaler(variables=["StartYear", "Salary", "Age"])),

# 4. Train model

("model", LinearRegression())

])

# Fit pipeline

pipe.fit(X, y)

# Transform features to see intermediate output without the train step

X_transformed = pipe[:-1].transform(X)

print(X_transformed)

The lines numbered 1-3 are the Feature-engine transform methods. The final step in the pipeline is the actual machine learning model:

1. MeanMedianImputer
- fit() calculates median of StartYear and Salary.
- transform() fills missing values with the stored median.
2. OneHotEncoder
- fit() identifies unique categories in Department.
- transform() converts categories into dummy variables.
3. MeanNormalizationScaler
- fit() learns mean of numeric columns.
- transform() scales numeric features consistently.
4. LinearRegression
- fit() trains model using transformed features.

Output:

StartYear Salary Age Department_HR Department_IT Department_Sales

0 -0.65 -0.333333 0.26 1 0 0

1 0.15 0.166667 -0.14 0 1 0

2 -0.15 -0.083333 0.46 0 0 1

3 0.15 -0.333333 -0.54 0 0 0

4 0.35 -0.083333 -0.22 0 1 0

5 0.15 0.666667 0.18 0 0 1

This is the final stage of the transformers in the pipeline and is what would be fed in to the final model step.

Including in scikit-learn pipeline

As the transformers implement the .fit() and .transform() methods, they can be included as part of a scikit-learn machine learning pipeline, allowing the transformers to be chained.

pipe = Pipeline([

("imputer", MeanMedianImputer(imputation_method="median", variables=["StartYear", "Salary"])),

("encoder", OneHotEncoder(variables=["Department"], drop_last=True)),

("scaler", MeanNormalizationScaler(variables=["StartYear", "Salary", "Age"])),

("model", LinearRegression())

])