How Feature-engine integrates with a scikit-learn

Feature-engine, a Python library for feature engineering

3 min read

Published Oct 3 2025


10
0
0
0

Feature EngineeringFeature-engineMachine LearningPandasPythonscikit-learnTransformers

What is the scikit-learn transformer API?

The scikit-learn transformer API is a standardised interface for preprocessing and modifying data in a way that can be consistently applied across training, testing, and new datasets. All transformers implement two main methods: fit() and transform(). The fit() method is used to learn parameters from the training data, such as the mean and standard deviation for scaling, the most frequent value for imputation, or the set of unique categories for encoding. Importantly, fit() does not change the data itself; it only calculates and stores the information required to apply the transformation later.


The transform() method is where the actual modifications to the data occur. It applies the parameters learned during fit() to generate the transformed output, such as filling missing values, scaling numeric features, encoding categories, or creating new derived columns. Many transformers also provide a combined fit_transform() method that runs both steps sequentially for convenience. Because transformers follow this API, they can be chained together in scikit-learn pipelines, allowing multiple preprocessing steps to be applied in order, ensuring that transformations are learned from the training set and applied consistently to test or unseen data. This design underpins the reproducibility and reliability of machine learning workflows in scikit-learn.






What happens at fit vs transform?

  • fit():
    • Only computes and stores parameters from the training data.
    • Example: for a MeanImputer, it calculates the column mean and stores it.
    • Example: for a YearExtractor, it doesn’t need to compute anything — it just notes what column it will operate on.
  • transform():
    • Actually modifies the data: fills missing values, encodes categories, extracts new columns (like year or age), etc.
    • So new columns are created at transform time, not at fit.





How Feature-engine Works & Integrates with scikit-learn

Feature-engine is pandas-first:

  • Implements both fit() and transform() methods and they take a DataFrame as input.
  • transform() always returns a DataFrame as output.
  • If a transformer adds columns (like age), the new DataFrame includes those columns — with proper column names preserved.

This means that Feature-engine transformers can be used in a scikit-learn pipeline, and the next transformer in the pipeline automatically “sees” the new columns, no extra manual work needed. It also allows you to view/chart the data at each transform step when initially setting up and deciding on transformers that are required.


Here is an example (with an extremely small data set, but gives an idea of the sort of transforms that can be done):

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OneHotEncoder
from feature_engine.scaling import MeanNormalizationScaler

# Test dataset
data = pd.DataFrame({
    "Department": ["HR", "IT", "Sales", "Admin", "IT", "Sales"],
    "StartYear": [2010, 2018, 2015, None, 2020, 2018],
    "Salary": [40000, 60000, None, 40000, 50000, 80000],
    "DateOfBirth": pd.to_datetime([
        "1980-01-01", "1990-06-15", "1975-09-30",
        "2000-12-25", "1992-10-28", "1982-04-20"
    ]),
    "AnnualBonus": [2000, 8500, 3400, 2784, 3242, 6643]
})

# Calculate age and drop DateOfBirth
current_year = pd.Timestamp.now().year
data["Age"] = current_year - data["DateOfBirth"].dt.year
X = data.drop(["AnnualBonus", "DateOfBirth"], axis=1)
y = data["AnnualBonus"]

# Build pipeline
pipe = Pipeline([
    # 1. Impute missing numeric values with median
    ("imputer", MeanMedianImputer(imputation_method="median", variables=["StartYear", "Salary"])),

    # 2. Encode Department column
    ("encoder", OneHotEncoder(variables=["Department"], drop_last=True)),

    # 3. Scale numeric features
    ("scaler", MeanNormalizationScaler(variables=["StartYear", "Salary", "Age"])),

    # 4. Train model
    ("model", LinearRegression())
])

# Fit pipeline
pipe.fit(X, y)

# Transform features to see intermediate output without the train step
X_transformed = pipe[:-1].transform(X)
print(X_transformed)

The lines numbered 1-3 are the Feature-engine transform methods. The final step in the pipeline is the actual machine learning model:

  • 1. MeanMedianImputer
    • fit() calculates median of StartYear and Salary.
    • transform() fills missing values with the stored median.
  • 2. OneHotEncoder
    • fit() identifies unique categories in Department.
    • transform() converts categories into dummy variables.
  • 3. MeanNormalizationScaler
    • fit() learns mean of numeric columns.
    • transform() scales numeric features consistently.
  • 4. LinearRegression
    • fit() trains model using transformed features.

Output:

   StartYear Salary Age Department_HR Department_IT Department_Sales
0 -0.65 -0.333333 0.26 1 0 0
1 0.15 0.166667 -0.14 0 1 0
2 -0.15 -0.083333 0.46 0 0 1
3 0.15 -0.333333 -0.54 0 0 0
4 0.35 -0.083333 -0.22 0 1 0
5 0.15 0.666667 0.18 0 0 1

This is the final stage of the transformers in the pipeline and is what would be fed in to the final model step.






Including in scikit-learn pipeline

As the transformers implement the .fit() and .transform() methods, they can be included as part of a scikit-learn machine learning pipeline, allowing the transformers to be chained.

pipe = Pipeline([
    ("imputer", MeanMedianImputer(imputation_method="median", variables=["StartYear", "Salary"])),
    ("encoder", OneHotEncoder(variables=["Department"], drop_last=True)),
    ("scaler", MeanNormalizationScaler(variables=["StartYear", "Salary", "Age"])),
    ("model", LinearRegression())
])

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact