Data Preprocessing and Feature Engineering

Scikit-learn Basics

4 min read

Published Nov 17 2025, updated Nov 19 2025

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Raw data is rarely ready for machine learning. It often contains missing values, inconsistent scales, or categorical variables that need encoding.

Data preprocessing is the process of transforming raw inputs into a clean, numerical format suitable for training.
Feature engineering extends this idea, it means creating or modifying features to improve model performance.

Scikit-learn provides robust, consistent tools for both preprocessing and feature engineering — all following the same familiar API (fit, transform, fit_transform).

This section covers:

Handling missing data
Scaling and normalisation
Encoding categorical variables
Feature selection and dimensionality reduction
Combining steps with Pipeline and ColumnTransformer

Why Preprocessing Matters

Machine learning algorithms assume data is numeric, clean, and comparable across features.

Without preprocessing:

Some models may fail outright, e.g. SVMs can’t handle NaNs.
Scale-sensitive algorithms may perform poorly.
Categorical data cannot be directly interpreted by numerical models.

Preprocessing ensures:

Numerical features are standardised
Categorical features are encoded
Missing data is handled consistently
The same transformation is applied to training and test sets

Handling Missing Data

Missing values are common in real datasets and must be handled before training. Scikit-learn provides the SimpleImputer class for this.

Strategies:

mean - replace missing values with the mean of the column
median - replace with the median (robust to outliers)
most_frequent - replace with the mode (useful for categorical)
constant - replace with a specific value (e.g. “unknown”)

Example: Using SimpleImputer

import numpy as np

from sklearn.impute import SimpleImputer

X = np.array([

[1, 2],

[np.nan, 3],

[7, 6]

])

imputer = SimpleImputer(strategy='mean')

X_imputed = imputer.fit_transform(X)

print("Imputed data:\n", X_imputed)

Output:

Imputed data:

[[1. 2. ]

[4. 3. ]

[7. 6. ]]

Always fit the imputer on your training data and then transform both train and test sets.

Feature Scaling and Normalisation

Different features may have different numerical ranges, e.g. “income” (in thousands) vs “age” (in years).
Some models (like SVMs, KNN, and gradient descent–based algorithms) are sensitive to these differences.

Standardisation

Centres data to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

After scaling:

Mean = 0
Standard deviation = 1

Normalisation

Scales individual samples (rows) to unit length:

from sklearn.preprocessing import Normalizer

normaliser = Normalizer()

X_norm = normaliser.fit_transform(X)

When to use:

StandardScaler: For most models (e.g. logistic regression, SVM, PCA)
MinMaxScaler: When values must remain within a specific range (e.g. [0,1])
Normalizer: For sparse data or distance-based models (cosine similarity)

Encoding Categorical Features

Machine learning algorithms can’t directly interpret text labels. Encoding converts categories into numeric representations.

1. One-Hot Encoding

Creates binary (0/1) columns for each category.

from sklearn.preprocessing import OneHotEncoder

import pandas as pd

data = pd.DataFrame({

'city': ['London', 'Paris', 'London', 'Berlin']

})

encoder = OneHotEncoder(sparse_output=False)

encoded = encoder.fit_transform(data[['city']])

print(encoder.get_feature_names_out())

print(encoded)

Output:

['city_Berlin' 'city_London' 'city_Paris']

[[0. 1. 0.]

[0. 0. 1.]

[0. 1. 0.]

[1. 0. 0.]]

2. Ordinal Encoding

Maps categories to integer values (use only for ordered categories):

from sklearn.preprocessing import OrdinalEncoder

import pandas as pd

data = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium']})

encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

encoded = encoder.fit_transform(data[['size']])

print(encoded)

Do not use ordinal encoding for unordered categories, it implies a false numerical relationship.

Feature Selection

Not all features are equally informative. Removing irrelevant or redundant features can improve model performance, interpretability, and training speed.

Scikit-learn offers several feature selection methods:

Method	Function	Description
VarianceThreshold	`VarianceThreshold()`	Removes features with low variance
SelectKBest	`SelectKBest(score_func=...)`	Keeps top K features by statistical test
RFE	`RFE(estimator)`	Recursively eliminates weak features

Example: SelectKBest with ANOVA F-test

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest, f_classif

X, y = load_iris(return_X_y=True)

selector = SelectKBest(score_func=f_classif, k=2)

X_new = selector.fit_transform(X, y)

print("Original shape:", X.shape)

print("Reduced shape:", X_new.shape)

Output:

Original shape: (150, 4)

Reduced shape: (150, 2)

Dimensionality Reduction (PCA)

Principal Component Analysis (PCA) is a popular method for reducing dimensionality while retaining as much variance as possible.
It projects features into a smaller number of orthogonal components.

Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)

Visualising the first two principal components:

import matplotlib.pyplot as plt

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.title('PCA projection of Iris data')

plt.show()

PCA helps simplify visualisation and modelling, especially with high-dimensional data.

Combining Steps with ColumnTransformer

Real datasets often contain both numeric and categorical columns. You can preprocess each type separately using ColumnTransformer.

Example:

import pandas as pd

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier

data = pd.DataFrame({

'age': [25, 30, 35, None],

'income': [50000, 60000, None, 80000],

'city': ['London', 'Paris', 'London', 'Berlin'],

'purchased': [1, 0, 0, 1]

})

X = data[['age', 'income', 'city']]

y = data['purchased']

numeric_features = ['age', 'income']

numeric_transformer = Pipeline([

('imputer', SimpleImputer(strategy='mean')),

('scaler', StandardScaler())

])

categorical_features = ['city']

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([

('num', numeric_transformer, numeric_features),

('cat', categorical_transformer, categorical_features)

])

model = Pipeline([

('preprocessor', preprocessor),

('classifier', RandomForestClassifier(random_state=42))

])

model.fit(X, y)

This ensures all preprocessing steps are learned from the training data and applied consistently.

Avoiding Data Leakage

Data leakage happens when information from the test set is used, directly or indirectly, during training.
This leads to overly optimistic performance estimates.

Common causes:

Scaling or imputing the entire dataset before splitting
Performing feature selection before splitting
Tuning hyperparameters using the test set

Prevention:

Always split data before any transformations.
Use Pipeline or ColumnTransformer to encapsulate preprocessing.
Keep test data completely unseen until the final evaluation.

Best Practices

Always standardise numeric features (especially for linear models or distance-based algorithms).
Handle missing data systematically - avoid dropping rows unless absolutely necessary.
Encode categorical variables carefully - use one-hot encoding for unordered categories.
Combine steps into pipelines - it prevents leakage and simplifies your workflow.
Document transformations - especially when sharing models for production or deployment.