How Feature-engine integrates with a scikit-learn
Feature-engine, a Python library for feature engineering
3 min read
Published Oct 3 2025
Guide Sections
Guide Comments
What is the scikit-learn transformer API?
The scikit-learn transformer API is a standardised interface for preprocessing and modifying data in a way that can be consistently applied across training, testing, and new datasets. All transformers implement two main methods: fit()
and transform()
. The fit()
method is used to learn parameters from the training data, such as the mean and standard deviation for scaling, the most frequent value for imputation, or the set of unique categories for encoding. Importantly, fit()
does not change the data itself; it only calculates and stores the information required to apply the transformation later.
The transform()
method is where the actual modifications to the data occur. It applies the parameters learned during fit()
to generate the transformed output, such as filling missing values, scaling numeric features, encoding categories, or creating new derived columns. Many transformers also provide a combined fit_transform()
method that runs both steps sequentially for convenience. Because transformers follow this API, they can be chained together in scikit-learn pipelines, allowing multiple preprocessing steps to be applied in order, ensuring that transformations are learned from the training set and applied consistently to test or unseen data. This design underpins the reproducibility and reliability of machine learning workflows in scikit-learn.
What happens at fit
vs transform
?
fit()
:- Only computes and stores parameters from the training data.
- Example: for a
MeanImputer
, it calculates the column mean and stores it. - Example: for a
YearExtractor
, it doesn’t need to compute anything — it just notes what column it will operate on.
transform()
:- Actually modifies the data: fills missing values, encodes categories, extracts new columns (like year or age), etc.
- So new columns are created at transform time, not at fit.
How Feature-engine Works & Integrates with scikit-learn
Feature-engine is pandas-first:
- Implements both
fit()
andtransform()
methods and they take a DataFrame as input. transform()
always returns a DataFrame as output.- If a transformer adds columns (like age), the new DataFrame includes those columns — with proper column names preserved.
This means that Feature-engine transformers can be used in a scikit-learn pipeline, and the next transformer in the pipeline automatically “sees” the new columns, no extra manual work needed. It also allows you to view/chart the data at each transform step when initially setting up and deciding on transformers that are required.
Here is an example (with an extremely small data set, but gives an idea of the sort of transforms that can be done):
The lines numbered 1-3 are the Feature-engine transform methods. The final step in the pipeline is the actual machine learning model:
- 1. MeanMedianImputer
fit()
calculates median ofStartYear
andSalary
.transform()
fills missing values with the stored median.
- 2. OneHotEncoder
fit()
identifies unique categories inDepartment
.transform()
converts categories into dummy variables.
- 3. MeanNormalizationScaler
fit()
learns mean of numeric columns.transform()
scales numeric features consistently.
- 4. LinearRegression
fit()
trains model using transformed features.
Output:
This is the final stage of the transformers in the pipeline and is what would be fed in to the final model step.
Including in scikit-learn pipeline
As the transformers implement the .fit() and .transform() methods, they can be included as part of a scikit-learn machine learning pipeline, allowing the transformers to be chained.