Data Preprocessing and Feature Engineering

Scikit-learn Basics

4 min read

Published Nov 17 2025, updated Nov 19 2025


11
0
0
0

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Raw data is rarely ready for machine learning. It often contains missing values, inconsistent scales, or categorical variables that need encoding.


Data preprocessing is the process of transforming raw inputs into a clean, numerical format suitable for training.
Feature engineering extends this idea, it means creating or modifying features to improve model performance.


Scikit-learn provides robust, consistent tools for both preprocessing and feature engineering — all following the same familiar API (fit, transform, fit_transform).


This section covers:

  • Handling missing data
  • Scaling and normalisation
  • Encoding categorical variables
  • Feature selection and dimensionality reduction
  • Combining steps with Pipeline and ColumnTransformer





Why Preprocessing Matters

Machine learning algorithms assume data is numeric, clean, and comparable across features.


Without preprocessing:

  • Some models may fail outright, e.g. SVMs can’t handle NaNs.
  • Scale-sensitive algorithms may perform poorly.
  • Categorical data cannot be directly interpreted by numerical models.

Preprocessing ensures:

  • Numerical features are standardised
  • Categorical features are encoded
  • Missing data is handled consistently
  • The same transformation is applied to training and test sets





Handling Missing Data

Missing values are common in real datasets and must be handled before training. Scikit-learn provides the SimpleImputer class for this.


Strategies:

  • mean - replace missing values with the mean of the column
  • median - replace with the median (robust to outliers)
  • most_frequent - replace with the mode (useful for categorical)
  • constant - replace with a specific value (e.g. “unknown”)

Example: Using SimpleImputer

import numpy as np
from sklearn.impute import SimpleImputer

X = np.array([
    [1, 2],
    [np.nan, 3],
    [7, 6]
])

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

print("Imputed data:\n", X_imputed)

Output:

Imputed data:
[[1. 2. ]
 [4. 3. ]
 [7. 6. ]]

Always fit the imputer on your training data and then transform both train and test sets.






Feature Scaling and Normalisation

Different features may have different numerical ranges, e.g. “income” (in thousands) vs “age” (in years).
Some models (like SVMs, KNN, and gradient descent–based algorithms) are sensitive to these differences.


Standardisation

Centres data to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

After scaling:

  • Mean = 0
  • Standard deviation = 1

Normalisation

Scales individual samples (rows) to unit length:

from sklearn.preprocessing import Normalizer

normaliser = Normalizer()
X_norm = normaliser.fit_transform(X)

When to use:

  • StandardScaler: For most models (e.g. logistic regression, SVM, PCA)
  • MinMaxScaler: When values must remain within a specific range (e.g. [0,1])
  • Normalizer: For sparse data or distance-based models (cosine similarity)





Encoding Categorical Features

Machine learning algorithms can’t directly interpret text labels. Encoding converts categories into numeric representations.


1. One-Hot Encoding

Creates binary (0/1) columns for each category.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = pd.DataFrame({
    'city': ['London', 'Paris', 'London', 'Berlin']
})

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data[['city']])

print(encoder.get_feature_names_out())
print(encoded)

Output:

['city_Berlin' 'city_London' 'city_Paris']
[[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

2. Ordinal Encoding

Maps categories to integer values (use only for ordered categories):

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

data = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium']})
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoded = encoder.fit_transform(data[['size']])
print(encoded)

Do not use ordinal encoding for unordered categories, it implies a false numerical relationship.






Feature Selection

Not all features are equally informative. Removing irrelevant or redundant features can improve model performance, interpretability, and training speed.


Scikit-learn offers several feature selection methods:

Method

Function

Description

VarianceThreshold

VarianceThreshold()

Removes features with low variance

SelectKBest

SelectKBest(score_func=...)

Keeps top K features by statistical test

RFE

RFE(estimator)

Recursively eliminates weak features


Example: SelectKBest with ANOVA F-test

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

X, y = load_iris(return_X_y=True)
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

print("Original shape:", X.shape)
print("Reduced shape:", X_new.shape)

Output:

Original shape: (150, 4)
Reduced shape: (150, 2)





Dimensionality Reduction (PCA)

Principal Component Analysis (PCA) is a popular method for reducing dimensionality while retaining as much variance as possible.
It projects features into a smaller number of orthogonal components.


Example:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)

Visualising the first two principal components:

import matplotlib.pyplot as plt

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA projection of Iris data')
plt.show()

scikit learn pca projection iris data

PCA helps simplify visualisation and modelling, especially with high-dimensional data.






Combining Steps with ColumnTransformer

Real datasets often contain both numeric and categorical columns. You can preprocess each type separately using ColumnTransformer.


Example:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

data = pd.DataFrame({
    'age': [25, 30, 35, None],
    'income': [50000, 60000, None, 80000],
    'city': ['London', 'Paris', 'London', 'Berlin'],
    'purchased': [1, 0, 0, 1]
})

X = data[['age', 'income', 'city']]
y = data['purchased']

numeric_features = ['age', 'income']
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_features = ['city']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

model.fit(X, y)

This ensures all preprocessing steps are learned from the training data and applied consistently.






Avoiding Data Leakage

Data leakage happens when information from the test set is used, directly or indirectly, during training.
This leads to overly optimistic performance estimates.


Common causes:

  • Scaling or imputing the entire dataset before splitting
  • Performing feature selection before splitting
  • Tuning hyperparameters using the test set

Prevention:

  • Always split data before any transformations.
  • Use Pipeline or ColumnTransformer to encapsulate preprocessing.
  • Keep test data completely unseen until the final evaluation.




Best Practices

  1. Always standardise numeric features (especially for linear models or distance-based algorithms).
  2. Handle missing data systematically - avoid dropping rows unless absolutely necessary.
  3. Encode categorical variables carefully - use one-hot encoding for unordered categories.
  4. Combine steps into pipelines - it prevents leakage and simplifies your workflow.
  5. Document transformations - especially when sharing models for production or deployment.

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact